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ABSTRACT. We present an efficient exact algorithm for estimating state sequences from 
outputs (or observations) in imprecise hidden Markov models (iHMM), where both the 
uncertainty linking one state to the next, and that linking a state to its output, are represented 
using coherent lower previsions. The notion of independence we associate with the credal 
network representing the iHMM is that of epistemic irrelevance. We consider as best estim- 
ates for state sequences the (Walley-Sen) maximal sequences for the posterior joint state 
model conditioned on the observed output sequence, associated with a gain function that is 
the indicator of the state sequence. This corresponds to (and generalises) finding the state 
sequence with the highest posterior probability in HMMs with precise transition and output 
probabilities (pHMMs). We argue that the computational complexity is at worst quadratic in 
the length of the Markov chain, cubic in the number of states, and essentially linear in the 
number of maximal state sequences. For binary iHMMs, we investigate experimentally how 
the number of maximal state sequences depends on the model parameters. We also present 
a simple toy application in optical character recognition, demonstrating that our algorithm 
can be used to robustify the inferences made by precise probability models. 



1. Introduction 

In Artificial Intelligence, probabilistic graphical models are becoming an increasingly 
powerful tool. Amongst these, hidden Markov models (HMMs) are definitely amongst the 
simplest, and perhaps also amongst the more popular ones. 

An important application for HMMs involves finding the sequence of (hidden) states 
with the highest posterior probability after observing a sequence of outputs [12]. For HMMs 
with precise local transition and emission probabilities, there is a quite efficient dynamic 
programming algorithm, due to Viterbi [12, 14], for performing this task. For imprecise- 
probabilistic local models, such as coherent lower previsions, we know of no algorithm in 
the literature for which the computational complexity comes even close to that of Viterbi's. 

In this paper, we take the first steps towards remedying this situation. We describe 
imprecise hidden Markov models as special cases of credal trees (a special case of credal 
networks) under epistemic irrelevance in Section 3. We show in particular how we can use 
the ideas underlying the MePiCTIr 1 algorithm [4], involving independent natural extension 
and marginal extension, to construct a most conservative joint model from imprecise local 
transition and emission models. We also derive a number of interesting and useful formulas 
from that construction. 

The results in Section 3 assume basic knowledge of the theory of coherent lower previ- 
sions, a generalisation of classical probability that allows for incomplete specification of 
probabilities. We include a short introduction to this theory in Section 2. 

In Section 4 we explain how a sequence of observations leads to (a collection of) so- 
called maximal state sequences. Finding all of them seems a daunting task at first: it has 
a search space that grows exponentially in the length of the Markov chain. However, in 
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Section 5 we use the basic formulas found in Section 3 to derive an appropriate version 
of Bellman's [1] Principle of Optimality, which allows for an exponential reduction of the 
search space. By using a number of additional tricks, we are able in Section 6 to devise the 
EstiHMM 2 algorithm, which efficiently constructs all maximal state sequences. We prove 
in Section 7 that this algorithm is essentially linear in the number of maximal sequences, 
quadratic in the length of the chain, and cubic in the number of states. We perceive this 
complexity to be comparable to that of the Viterbi algorithm, especially after realising that 
the latter makes the simplifying step of resolving ties more or less arbitrarily in order to 
produce only a single optimal state sequence. This is something we will not allow our 
algorithm to do, for reasons that should become clear further on. 

In Section 8, we consider the special case of binary iHMMs, and investigate experi- 
mentally how the number of maximal state sequences depends on the model parameters. 
We comment on the very interesting structures that emerge, and give them an heuristic 
explanation. 

We show off the algorithm's efficiency in Section 9 by calculating the maximal sequences 
for a specific iHMM of length 100. 

We conclude in Section 10 with a simple toy application in optical character recognition. 
It demonstrates the advantages of our algorithm and gives a clear indication that the 
EstiHMM algorithm is able to robustify the existing Viterbi algorithm in an intelligent 
manner. 

In order to make our main argumentation as readable as possible, we have relegated all 
technical proofs to an appendix. 

2. Freshening up on coherent lower previsions 

We begin with some basic theory of coherent lower previsions. See Ref. [15] for an 
in-depth study, and Ref. [9] for a recent survey. 

Coherent lower previsions are a special type of imprecise probability model. Roughly 
speaking, whereas classical probability theory assumes that a subject's uncertainty can be 
represented by a single probability mass function, the theory of imprecise probabilities 
effectively works with sets of possible probability mass functions, and thereby allows for 
imprecision as well as indecision to be modelled and represented. For people who are 
unfamiliar with the theory, looking at it as a way of robustifying the classical theory is 
perhaps the easiest way to understand and interpret it, and we will use this approach here. 

Consider a set ^# of probability mass functions, defined on a discrete set 3£of possible 
states. With each mass function p € Ji , we can associate a linear prevision (or expectation 
operator) P p , defined on the set of all real-valued maps on 3t. Any / € is also 

called a gamble on 3£, and P p (f) := Y,xe$rP( x )fi x ) K me expected value of /, associated 
with the probability mass function p. We can now define the lower prevision P_j^ that 
corresponds with the set Jt as the following lower envelope of linear previsions: 

Z*<J) ■= irf{P p {f) :p€^} for all gambles / in X. (1) 

Similarly, we define the upper prevision P /, as 

P.AJ) ■= sup {P p (f) : pe^} 

= -M{-P p (f): p e Jt] = -inf {/>,(-/): p e = -P^(-f) (2) 

for all gambles / on X. We will mostly talk about lower previsions, since it follows from 
the conjugacy relation (2) that the two models are mathematically equivalent. 

An event A is a subset of the set of possible values fiACj. With such an event, we 
can associate an indicator I A , which is the gamble on JTthat assumes the value 1 on A, and 
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outside A. We call 




the lower probability of the event A, and similarly P :l g{A) := P^(Ia) its upper probability. 

It can be shown [15] that the functional P^ satisfies the following set of interesting 
mathematical properties, which define a coherent lower prevision: 



CI. £#(/)> min/ for all / € 

C2. P.jt{Xf) = XP,Af) for a11 / e Sf(^0 and all real A > 0, [non-negative homogeneity] 



Every set of mass functions ^ uniquely defines a coherent lower prevision P /r . but in 
general the converse does not hold. However, if we limit ourselves to sets of mass functions 
^ that are closed and convex — which makes them credal sets — they are in a one-to-one 
correspondence with coherent lower previsions [15]. This implies that we can use the theory 
of coherent lower previsions as a tool for reasoning with closed convex sets of probability 
mass functions. From now on, we will no longer explicitly refer to credal sets jtft, but 
we will simply talk about coherent lower previsions P. It is useful to keep in mind that 
there always is a unique credal set that corresponds with such a coherent lower prevision: 
P = P M for some unique credal set J(, given by Jt = { p : ( V/ e ^ ( %) )P p (/) > P (/) } . 

A special kind of imprecise model on Jf is the vacuous lower prevision. It is a model that 
represents complete ignorance and therefore has the set of all possible mass functions on 
Xas its credal set M '. It can be shown easily that for every / € the corresponding 

lower prevision is given by P(f) = min /. 

Conditional lower and upper previsions, which are extensions of the classical conditional 
expectation functionals, can be defined in a similar, intuitively obvious way as lower 
envelopes associated with sets of conditional mass functions. 

Consider a variable X in Xand a variable Y in <3f . A conditional lower prevision P(-\Y) 
on the set &(X) of all gambles on JTis a two-place real-valued function. For any gamble / 
on X, P{f\Y) is a gamble on W, whose value P(g\y) in y G <3f is the lower prevision of g, 
conditional on the event Y = y. If for any y G *3f, the lower prevision P(-\y) is coherent — 
satisfies conditions C1-C3 — then we call the conditional lower prevision P(-\Y) separately 
coherent. It will sometimes be useful to extend the domain of the conditional lower prevision 



P(-\y) from to Sf ( 3Cx <&) by letting P(f\y) := P{f{-,y)\y) for all gambles / on 



If we have a number of conditional lower previsions involving a number of variables, 
then each of them must be separately coherent, but we also have to make sure that they 
satisfy a more stringent joint coherence requirement. Explaining this in detail would take us 
too far, but we refer to Ref. [15] for a detailed discussion, with motivation. For our present 
purposes, it suffices to say that joint coherence is very closely related to making sure that 
these conditional lower previsions are lower envelopes associated with conditional mass 
functions that satisfy Bayes's Rule. 

For a given lower prevision P on ^( JTx W), a corresponding conditional lower prevision 
P(-\Y) that is jointly coherent with P is not uniquely defined. It is however shown in Ref. [10] 
that it always lies between the so-called natural and regular extensions. 

Using natural extension, the conditional coherent lower prevision P(-\Y) is defined 
by P(f\}') ■= maxjjU € K: P(I w [/-ju]) > 0} if P({y}) > 0, and it is vacuous and thus 
given by P(f\y) '■= min / if P{{y}) = 0. This is the smallest (most conservative) way 
of conditioning a lower prevision. If P({y}) > 0, it corresponds to conditioning every 
probability mass function in the credal set of P on the observation that Y = y and taking the 
lower envelope of all these conditioned mass functions. 



Using regular extension, the conditional coherent lower prevision P(-\Y) is defined by 
P(f\y) := max { pt e R : P(l y [f - ju]) > 0} if P({y}) > 0, and it is vacuous if P({y}) = 0. 




[superadditivity] 



Xx &. 
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This gives us the greatest (most informative) conditional lower prevision that is jointly 
coherent with the original unconditional lower prevision. It corresponds to taking all mass 
functions p in the credal set of P for which p(y) ^ 0, conditioning them on the observation 
that Y — y and taking their lower envelope. 

Natural and regular extension coincide if P({y}) > or P({y}) — but are different if 
P({y}) > E({y}) — 0- In the latter case, natural extension is vacuous, but regular extension 
usually remains more informative. 

In this introduction, coherent lower previsions were interpreted as an alternative repres- 
entation for closed and convex sets of probability mass functions. This approach is often 
adopted by sensitivity analysts and is rather intuitive for people who are used to working in 
classical probability theory. For the sake of completeness, we mention here that coherent 
lower previsions can also be given a behavioural interpretation, without using the notion 
of a probability mass function. The lower prevision P(f) of a gamble / £ < £{3£) can be 
interpreted as the supremum acceptable buying price that a subject is willing to pay in order 
to gain the (possibly negative) reward f(x) after the outcome x £ 5£oi the experiment has 
been determined. See Ref. [15] for more information regarding this interpretation. 



3. Basic notions 



An imprecise hidden Markov model can be depicted using the following probabilistic 
graphical model: 



QA-) Q 7 m 



State sequence: ( X\ 



Output sequence: 



MX 2 



o 2 



s.m) s 2 (-\x 2 ) 



Q k m-i) 



Q»{-\Xn-l) 

x„ 



O n 



S n (-\Xn) 



Figure 1 . Tree representation of a hidden Markov model 

Here n is some natural number. The state variables X\, X„ assume values in the 
respective finite sets 5£\, ... , S£ n , and the output variables 0\, ... , 0„ assume values in the 
respective finite sets 0\, . . . , G n . We denote generic values of Xk by Xk, Xk or Zk, and generic 
values of by o^. 



3.1. Local uncertainty models. We assume that we have the following local uncertainty 
models for these variables. ForXi, we have a marginal lower prevision Q , defined on the 
set ^(JTi) of all real-valued maps (or gambles) on 2£\. For the subsequent states X k , with 
k £ {2, . . .,n}, we have a conditional lower prevision QA-\X^i) defined on called 
a transition model. In order to maintain uniformity of notation, we will also denote the 
marginal lower prevision Q l as a conditional lower prevision QJ-\Xq), where Xq denotes 
a variable that may only assume a single value, and whose value is therefore certain. For 
any gamble /j in Q. k {fk\Xk-\) is interpreted as a gamble on whose value 

Q k (fk\zk-i) in any Zk-i £ 3&k-\ is the lower prevision of the gamble fk(Xk), conditional on 
Xk-i = Zk-i- 

In addition, for each output Ok, with k £ { 1 , . . . , n}, we have a conditional lower prevision 
S k (-\Xk) defined on§f (^), called an emission model. For any gamble gk in ^(^), S k (gk\Xk) 
is interpreted as a gamble on =3%, whose value S k (gk\zk) in any Zk € 3^k is the lower prevision 
of the gamble gk(Ok), conditional on Xk = Zk- 
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We take all these local (marginal, transition and emission) uncertainty models to be 
separately coherent. Recall that this simply means that for any k g {1, ...,«}, the lower 
prevision QA-\zk-\) should be coherent (as an unconditional lower prevision) for every 
Zk-i € ^l-i and S k (-\zk) should be coherent for every Zk G SC k . 

3.2. Interpretation of the graphical structure. We will assume that the graphical repres- 
entation in Figure 1 represents the following irrelevance assessments: conditional on its 
mother variable, the non-parent non-descendants of any variable in the tree are epistemic- 
ally irrelevant to this variable and its descendants. We say that a variable X is epistemically 
irrelevant to a variable Y if observing X does not affect our beliefs about Y. Mathematically 
stated in terms of lower previsions: P(f(Y)) = P(f(Y) \x) for all / G and all x g SC. 

Before we go on, it will be useful to introduce some mathematical short-hand notation 
for describing joint variables in the tree of Figure 1. For any 1 < k < £ < n, we denote 
the tuple (X k ,X k+ \ , . . . ,X() by X k .(, and the tuple (Ok, O k+ \ , ■ ■ ■ , Of) by O k j. X k -j is a (joint) 
variable that can assume all values in the set 3E k :t := ~x r=k SC r , and Otj is a (joint) variable 
that can assume all values in the set ff k -j := ~K l r=k G r . Generic values of X k -j> are denoted by 
Xk-.t orx k -j, and generic values of O k j by o k -j. 

Example 1. Consider the variable X k with mother variable X k _\ in Figure 1. The variables 
X\-.k-2 an d 0\±-\ are its non-parent non-descendants, and the variables X k+ \- n and O k:n its 
descendants. Our interpretation of the graphical structure of Figure 1 implies that once we 
know (conditional on) the value x kl of X k _\, additionally learning the values of any of the 
variables X\, X k -2 and 0\, Ok-\ will not change our beliefs about X k: „ and O k:n . ♦ 

Epistemic irrelevance is weaker than the so-called strong independence condition that is 
usually associated with credal networks [2], which is the name usually given to probabilistic 
graphical models with coherent lower previsions as local uncertainty models. Recent work 
[4] has shown that using this weaker condition guarantees that an efficient algorithm exists 
for updating beliefs about a single target node of a credal tree, that is essentially linear in 
the number of nodes in the tree. 

3.3. A joint uncertainty model. Using the local uncertainty models, we now want to 
construct a global model: a joint lower prevision P on ^(SC\ :n x ff\- n ) for all the variables 
(Xi :n ,Oi :n ) in the tree. This joint lower prevision should (i) be jointly coherent with all the 
local models; (ii) encode all epistemic irrelevance assessments encoded in the tree; and (iii) 
be as small, or conservative, 3 as possible. This is a special case of a more general problem for 
credal trees, discussed and solved in great detail in Ref. [4]. In this section, we summarise 
the solution for iHMMs and give an heuristic justification for it, but we refer to Ref. [4] 
for a proof that the joint model we present below is indeed the most conservative lower 
prevision that is coherent with all the local models and captures all epistemic irrelevance 
assessments encoded in the tree. 

We proceed in a recursive manner, and consider any k g {1, . . . ,n}. For any x k -\ g 3E k -\, 
we consider the smallest coherent joint lower prevision P k (-\xk-i) on^(<5?^„ x fft.n) for 
the variables (Xt n , Otn) on me iHMM depicted in Figure 2, representing a subtree of the 
tree represented in Figure 1, with the lower prevision QA-\xk-\) acting as the marginal 
model for the 'first' state variable Xk- Note that the global model P we are looking for 
can be identified with the conditional lower prevision P 1 (-\Xq), for the reasons given in 
Section 3.1. 

Our aim is to develop recursive expressions that enable us to construct P k ( • \X k - 1 ) out 
of fjr+iOI^Ar)- Using these expressions over and over again will eventually yield the global 
model P = Pi(-\Xo). 

In a first step, we combine the joint model ('1^0 f° r the variables (X k+ i :n ,Ok+i :n ), 
defined on ^{^k+x-.n x ^k+i-.n) — see the thick dotted lines in Figure 2 — ,with the local 



'Recall that point-wise smaller lower previsions correspond to larger credal sets. 
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Gt(-k-i) 
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Figure 2. Subtree of the iHMM involving the variables (X k - n ,Ok- n 



model for the variable O k , defined on Sf(<^). This will lead to a joint model 

for the variables (Xk+i:n>Ok: n )> defined on < &{5E\ l +\. n x ^fc„) — see the semi-thick 
dotted lines in Figure 2. This is trivial for k = n, since we must have that (• |X n ) = S„ (■ |X„ ) . 

For ky^n, the solution is less obvious. A joint model can be constructed in many different 
ways, so we will have to impose some conditions. A first condition is that should be 

a separately coherent conditional lower prevision that is jointly coherent with the 'marginal' 
models P k+ i(-\X k ) and 5jt(-|X^). A second, rather obvious, condition is that E_ k (-\X k ) should 
coincide with P k +i(-\Xk) and S k (-\X k ) on their respective domains. A third condition is 
that the model should capture the epistemic irrelevance assessments encoded in the tree. 
In particular these state that, conditional on X k , the two variables (X/ (+ i-„, Ok+\-.n) an d Ok 
should be epistemically independent, or in other words, epistemically irrelevant to one 
another. 

Any model that meets all these conditions is called a (conditionally) independent product 
[5] of P_k + \{-\Xk) and S k (-\Xk). Generally speaking, such a (conditionally) independent 
product is not unique. We call the point-wise smallest, most conservative, of all possible 
(conditionally) independent products, which always exists, the (conditionally) independent 
natural extension [15, 5] of P_k + \{-\Xk) and S k (-\Xk), and we denote it as P_k+i(-\Xk) ® 

s k m). 

Summarising, E_k('\Xk) is given by 



E k m) 



S n (-\X n ) 



k = n 

k = n — 1, . . . 



(3) 



The (conditionally) independent natural extension and its properties were studied in 
great detail in Ref. [5]. For the purposes of this paper, it will suffice to recall from that 
study that — very much like independent products of precise probability models — such 
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independent natural extensions are factorising, which implies in particular that 

E k (fg\z k )=E k (gE k (f\z k )\z k ) 
= S k (gP k+l (f\zk)\z k ) 

= is k (g\z k )P k+l (f\z k ) if P k+l (f\z k ) > 

\S k (g\z k )P k+l (f\z k ) ifP k+l (f\z k )<0 

= Mg\z k )QP k+l {f\z k ), (4) 

for all z k £ 3£ k , all / e if(^ k+ i :n x fi k+ \ :n ) and all non-negative g £ Sf(^t) — we call a 
gamble non-negative if all its values are. In this expression, the first equality is the actual 
factorisation property. The second equality holds because E_ k {-\X k ) coincides with P k+1 (-\X k ) 
and S k (-\X k ) on their respective domains. The third equality follows from the conjugacy 
relation (2) and coherence condition C2, and for the fourth we have used the shorthand 
notation MQx := mmax{0,x} + mmin{0,x}. Further on, we will also use the analogous 
notation TnnQx := w«max{0,x} + m«min{0,x}. 

In a second and final step, we combine the joint model E k (-\X k ) for the variables 
{X k +i:n,O k:n ), defined on ^{^ k+ \- n x k:n ), with the local model Q k (-\x k -\) for the vari- 
able X k , defined on ( £(3£ k ), into the joint model Pj.(-\X k _\) for the variables (X k:n ,O k:n ), 
defined on ^(^ k - n x & k:n ). It has been shown elsewhere [15, 1 1] that the most conservative 
coherent way of doing this, is by means of marginal extension, also known as the law ot 
iterated (lower) expectations. This leads to P_ k {-\x k -\) := Q^E^^X^x^i), or, if we now 
allow x k -\ to range over & k -\: 

U-\Xk-i)-=Q k {E k (-\mXk-i)- (5) 
For practical purposes, it is useful to see that this is equivalent with 

P k (m-i)=Q Jc ( £ \ k} E k {f{z k ,X k+l:n ,O k .. n )\z k )X k . 

for all / £ < £(3£ k:n x @ k:n )- Recall that in this expression, the indicator 1^ is a gamble on 
3£ k that assumes the value \\fX k = z k and if X k ^z k - 

3.4. Interesting lower and upper probabilities. Without too much trouble, 4 , we can use 
Equations (3)-(5) to derive the following expressions for a number of interesting lower and 
upper probabilities: 

Pk({ot.n} X {Z k:n }\z k -l) = n^MkOgfelk'-l) (6) 

i=k 

Pk({ot.n} X {z k:n }\Zk-l) = f{Si{{Oi}\Zi)Qi{{Zi}\Zi-l) (7) 

i=k 

for all Zk-\ £ 2£ k -\, z k:n £ 3£ k -. n , o k - n £ &t.n and k £ {1,. ..,«}, and 

E k ({o k:n }x{z k+l ;n}\z k )=S k ({ok}\zk) fl moiMzdgdz^Zi-i) (8) 

i= k +l 

E k ({o k:n }x{z k+Un }\z k )=S k ({ok}\z k ) fl Si({Oi}\Zi)Qt({z,}\z,-l). (9) 

i= k +l 

for all z k £ S£ k , z k +\-. n £ <^i+i:n> °t.n £ @ k -.n an d k £ {1,. . . ,n}. We will assume throughout 
that 

P{{z\:n} X {0\:n}) > for all Zl :n £ %l:n and 0\ :n £ fi\ :n 



As an example, we derive Equations (6) and (7) in Appendix A. 
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or equivalently, that all local upper previsions are positive, in the sense that [4]: 

Qk({zk}\z k -i) > and S k ({ok}\zk) > 

for all zk-i € %k-u Zk e 3> k , o k e @k and k e {1,.. . ,«}. (10) 

This assumption is very weak and not at all restrictive for practical purposes. The imprecise- 
probabilistic local models are usually constructed by adding some margin of error around a 
precise model, thereby making all upper transition probabilities positive by construction. 
We will however allow lower transition probabilities to be zero, which is something that 
does happen often in practical problems. 

Proposition 1. The assumption (10) that all local upper previsions are positive implies that 
Pk{{ot.n}\zk-\) > and E k ({o k:n }\zk) > for alike {1, . . . ,n}, Zk S Zk-i € J^-i and 

Ok:n € &k:n- 

4. Estimating states from outputs 

In a hidden Markov model, the states are not directly observable, but the outputs are, 
and the general aim is to use the outputs to estimate the states. We concentrate on the 
following problem: Suppose we have observed the output sequence 0\- n , estimate the state 
sequence x\- n . We will use an essentially Bayesian approach to do so, but need to allow for 
the fact that we are working with imprecise rather than precise probability models. 

4.1. Updating the iHMM. The first step in our approach consists in updating (or condi- 
tioning) the joint model P := Pj (-\Xq) on the observed outputs 0\ M — o\ n . As mentioned 
in Section 2, there is no unique coherent way to perform this updating. However, for the 
particular problem we are solving in this paper, it so happens that it makes no difference 
which updating method is used, as long as it is coherent. For the time being, we choose to 
use the least conservative 5 (most informative) coherent updating method, which is regular 
extension. Later on in Section 4.2, we will show that any other coherent updating method 
yields the same results. 

Since it follows from the positivity assumption (10) and Proposition 1 that P({oi :n }) > 0, 
regular extension leads us to consider the updated lower prevision P(-|oi : „) on ( g(3£\ :n ), 
given by: 

P(/h:„) ^maxjjUeR: P(I {oi: „ } [/-M]) > 0} for all gambles / on JT 1: „. (11) 
Using the coherence of the joint lower prevision P, it is not hard to prove that when 
P({oi-.n}) > 0, P(J{ 0l }\f — M]) i s a strictly decreasing and continuous function of jU, 
which therefore has a unique zero (see Lemma 7(i)&(iii) in Appendix A). As a consequence, 
we have for any / £ < £(3£i :n ) that 

P(f\oi :n ) < (Vai > 0)P(I {oi:n} [/ - At]) < P(I {oi: „ } /) < 0. (12) 

In fact, it is not hard to infer from the strictly decreasing and continuous character of 
P(I{ 0l . n }[/ — M]) m at P(/|oi:n) and P(I{ 0l . n }/) have the same sign. They are either both 
negative, both positive or both equal to zero; see also the illustration below. 









mou„}f) 










h > 

\P(I {oi:J [/-M]) 






0\ :n ) 



'The most conservative coherent way yields a vacuous model. 
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Equation (12) will be of crucial importance further on. However, in general, we want to allow 
P({oi-.n}) to be zero (because this may happen if you allow lower transition probabilities 
to be zero), while requiring that P({o{ n }) > (because this follows from the positivity 
assumption (10) and Proposition 1). This will, generally speaking, invalidate the second 
equivalence in Equation (12): it turns into an implication only. But, if we limit ourselves to 
the specific type of gambles on 3C\-. n of the form / = ^{x Vn } ~ Ij*, .„}> we can still prove the 
following important theorem. 

Theorem 2. If all local upper previsions are positive, then P(I{ 0l7 j pt{^. B } ~ .„}]) and 
/"(Ij^.j —^{x Vn }\°l:n) have the same sign for all fixed values of X\ :n ,X\ :n € Sti-j, and 
o\-n € ff\-.n- They are both positive, both negative or both zero. 

4.2. Maximal state sequences. The next step now consists in using the posterior model 
P(-|oi:n) to find best estimates for the state sequence x\- n . On the Bayesian approach, this is 
usually done by solving a decision-making, or optimisation problem: we associate a gain 
function h Xl . n \ with every candidate state sequence x\- n , and select as best estimates those 
state sequences x\ n that maximise the posterior expected gain, resulting in state sequences 
with maximal posterior probability. 

Here we generalise this decision-making approach towards working with imprecise 
probability models. The criterion we use to decide which estimates are optimal for the given 
gain functions is that of (Walley-Sen) maximality [13, 15]. Maximality has a number of 
very desirable properties that make sure it works well in optimisation contexts [6, 8], and it 
is well-justified from a behavioural point of view, as well as in a robustness approach, as we 
shall see presently. 

We can express a strict preference >- between two state sequence estimates x\- n and x\- n 
as follows: 

X\:n y Xl-.„ <=> P(% 1: „} - l{ Xl . n ] |oi:n) > 0. 

On a behavioural interpretation, this expresses that a subject with lower prevision P(-\oi :n ) 
is disposed to pay some strictly positive amount of utility to replace the (gain associated with 
the) estimate x\- n with the (gain associated with the) estimate x\- n \ see Ref. [15, Section 3.9] 
for more details. Alternatively, from a robustness point of view, this expresses that for 
each conditional mass function p(-\o\ :n ) in the credal set associated with the updated lower 
prevision P(-\o\- n ), the state sequence x\-_ n has a posterior probability p{x\- n \o\ M ) that is 
strictly higher than the posterior probability p{x\- n \o\ :n ) of the state sequence x\- n . 

The binary relation y- thus defined is a strict partial order [an irreflexive and transitive 
binary relation] on the set of state sequences X\- n , and we consider an estimate x\ :n to be 
optimal when it is undominated, or maximal, in this strict partial order: 

X\;n :n )X\;n 7 x \:n 

(Vx 1: „ G X l:n )P(I {xUn} -l {Xl Jo l:n ) < 
«• (Vx 1: „ G %~l:n)P{\n-,} hx 1:n } - h^-J) < °' d3) 

where the very useful last equivalence follows from Theorem 2. In summary then, the aim 
of this paper is to develop an efficient algorithm for finding the set of maximal estimates 
opt(jr 1: „|oi :n ). 

Our statement in Section 4.1, that any coherent updating method would yield the same 
results as regular extension, can now be justified. Since coherent updating is unique if 
P({oi-.n}) > 0, we only need to motivate our statement in the special case that P({oi :n }) = 
andP({oi :n }) >0. 

If we use regular extension to update our model, the optimal estimates are given by 
Eq. (13). For the special case P({oi :n }) — however, we find for all x\- n G !%\- n and 

%\:n that 



£0WPW -%,*}]) <£(W) = p({ov,}) = 0, 
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where the first inequality follows from the monotonicity of coherent lower previsions (as 
a consequence of CI and C2). Therefore, we find that if P({o\ :n }) = 0, all sequences are 
optimal, resulting in opt(%\ :n \o\ :n ) = 3£\ :n . 

If we use natural extension to update our joint model, the optimal state sequences are still 
given by Eq. (13), but the final equivalence would no longer hold because it uses Theorem 2, 
which assumes the use of regular extension to perform updating of the joint model. However, 
for the special case of P({oi :n }) — 0, natural extension by definition leads to the updated 
model being equal to the vacuous one. Therefore, we find for all x\- n e JTi :n and x\- n £ !%\- n 
that 

£(W -hhJ^-n) = "hOW - W < o- 

This implies that for the special case of P({oi :n }) = and P({oi„}) > — identical to 
what we found for regular extension — natural extension also results in all sequences being 
optimal, meaning that opt(^i : „|oi : „) = 3C\; n . 

We have thus shown that, in the special case when P({oi :n }) = and P({oi- n }) > 0, 
the set of optimal sequences is the same, regardless of whether we use natural or regular 
extension to update our joint model. Since every other coherent updating method lies in 
between those two methods, opt(^i ;n |oi :n ) does not depend on the updating method, as 
long as it is coherent. If P({oi :n }) > 0, coherent updating is unique and thus equal to regular 
extension, thereby making this result trivial in that case. We can therefore conclude that the 
results in this paper do not depend on the particular updating method that is chosen, as long 
as it is coherent. 

Instead of looking for the maximal state sequences, one could also use other decision 
criteria. A first approach that we will not consider here, could consist in trying to find the 
so-called T-maximin state sequences x\ :n , which maximise the posterior lower probability: 

x\-. n e argmaxP({xi : „}|oi : „) 

While it is well known that any such T-maximin sequence is in particular guaranteed to 
also be a maximal sequence, finding such T-maximin sequences seems to be a much more 
complicated affair. 6 Of course, once we know all maximal solutions, we could determine 
which of them are the T-maximin solutions by comparing their posterior lower probabilities. 
As far as we can see, however, calculating these seems no trivial task from a computational 
point of view. 

We expect similar computational difficulties with yet another approach, also not con- 
sidered here, which consists in finding the so-called E-admissable sequences. They are 
those sequences that maximise the expected gain for at least one conditional mass function 
p{-\o\-„) in the credal set associated with the updated lower prevision P{-\o\ :n ). Similarly 
to the T-maximin solutions, the E-admissable ones are also known to be contained within 
the set of maximal ones that we will be constructing. 

The main reason why our approach is so efficient compared to the other ones, is that 
we do not have to explicitly calculate the value of lower previsions, but only need to know 
their sign, thereby allowing us to work directly with the joint model, instead of the updated 
model. 

4.3. Maximal subsequences. We shall see below that in order to find the set of maximal 
estimates, it is useful to consider more general sets of so-called maximal subsequences: for 
any k e {1,. . . ,n} and z k -\ & %~k-i, we define opt(%~t.n\zk-i,ot. n )'- 

X k:n G opt{^ k:n \z k -uO k :n) (Vx A: „ G Jf t „) P k (I {l>k . n} - \z k -l) < 0. (14) 

The interpretation of these sets is immediate: consider the following part of the original 
iHMM, where we take Q.(-\z k -i) as the marginal model for the first state X k : 



'Private communication from Cassio de Campos. 
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GfcMft-l) Q r m-i) Q n {-\Xn-i) 
State subsequence: ( Xk ) >/ X,- ) >/ X„ 



Output subsequence: (^^) C^y^ 

SfcOlxit) s,(-|x,) 

Then, as we have argued in Section 3.3, the corresponding joint lower prevision on 
< £{3&k:n x @k.n) is precisely P k (-\zk-i), and if we have a sequence of outputs ot. n , then 
opt ( 3£k:n \Zk- 1 ! °t.n ) is the set of state sequence estimates that are undominated by any other 
estimate in 5£k-.n- It should be clear that the set opt(^i :n |oi :n ) we are eventually looking 
for, can also be written as opt(^i :B |zojOi : „). 

4.4. Useful recursion equations. Fix any k in {1,...,«}. If we look at Equation (14), 
we see that it will be useful to derive a manageable expression for the lower prevision 
£jc(\o k . n } $-{x k . n ] ~ !{.%.„}] kit- 1)- This can be easily done (see Appendix A) using Equa- 
tions (3)-(7) together with a few algebraic manipulations. We consider three different cases. 
If Xk = Xk and k € {l,...,n— 1} then, using the notation introduced in Section 3.3: 

= l t ({^}k-i)&({o*}|iO©at+i^i»}^i»}-W»}l**)- (!5) 

If i„ = x„ then 

^(I {o4 [I W -%, } |z„-i) = 0. (16) 
If Xk ^ Xk and k E { 1 , . . . , «} then 

fifcOWlPkw -1 {4:„}l z *-i) = G^Ijx,}^^) -I{^}a(i* :n )|zn), (17) 
where we define, for any Zk-.n G ^^fc«- 

Pfeb.) := = S*({o*}|z*) fl ^({oiUzOg-Cfellzi-i) (18) 

i=k+l 

a(zk-. n ) ■= m{o,„}h, +1 Jz k ) = S k ({o k }\zk) f\ Si({*i}l*)&(&}|zi-i)- (19) 

i=k+l 

For any given sequence of states zt.n G <^:n> the u(zt.n) an d j3 (zjt ;n ) can be found by simple 
backward recursion: 

u(zt.n) ■= a{zk+v.„)Sk({ok}\z k )Qk+i({zk+i}\zk) (20) 

P(Zk:n) ■=li(Zk + l-.n)Sk({0k}\Zk)Q k+l ({Zk + l}\Zk), (21) 

for k € {l,...,n— 1}, and starting from: 

a(z„:„) = a(zn) :=5„({o„}|z„) and/3(z„ : „) =P(z n ) S n {{o n }\z„)■ 
5. The Principle of Optimality 

Determining the state sequences in opt (^i :n \oi :n ) directly using Equation (13) clearly 
has exponential complexity (in the length of the chain). We are now going to take a dynamic 
programming approach [1] to reducing this complexity by deriving a recursion equation for 
the sets of optimal (sub)sequences opt(3^k:n\zk-i,Ok- n )- 



12 



JASPER DE BOCK AND GERT DE COOMAN 



Theorem 3 (Principle of Optimality). For k e {1, . . . ,« — 1}, all Zk-i € 2&k-\ ana * all 
xt.„ G %k-.n- if Q k {{xk}\zk-\) > andS Ji ({o k \\x k ) > 0, then 

Xt.n € Opt(& k:n \zk-l,O k:n ) =>X k+ i : „ G Opt (%k+l:n\Xk, 0*+l:n) • 

As an immediate consequence, we find that 

opt(^t :n |zt-i,Ofc B ) Ccaad(^i : „|zjfc_i,o fcB ), (22) 

with cand(%~ k:n \zk-i,o k:n ) being the set of sequences in J^. :n that can still be an element of 
opt ( 3£k:n \zk- i,o k: n) according the the theorem above: 

cand(^ k:n \zk-i,o k:n ) 

(J Zk®opt{3r k+ i M \z k ,o k+lM )ju( (J z*e^+i:„V (23) 

Here denotes concatenation of state sequences and the set of states Pos k (zk-i) C is 
defined as 

G Posfcfe-j) ^(feUzn) > en S t (H}|z*) > 0. (24) 
Equation (23) simplifies to 

cand(^ k:n \zk-i,o k:n ) = (J z / t©opt( t £" <:+1: „|z it ,o jt+1: „) (25) 

if all local lower previsions are positive, but this is not generally true in the more general 
case we are considering here, where only the upper previsions are required to be positive. 
We also introduce the following notation: 

cand *fa (%k:n\Zk-l,Ok:n) ■= {zt.n G cand(^ k:n \z k -i,O tn ) ■ Zk:s = *k:s} (26) 

for all fee {1,.. . ,n}, se{k,.. . ,«}, Zk-i G %k-u Xks G %k-.s and o k: „ € ff k -.„. 

6. An algorithm for finding maximal state sequences 

We now use Equation (22) to devise an algorithm for constructing the set opt(«^i : „|oi : „) 
of maximal state sequences in a recursive manner. 

6.1. Initial set-up using backward recursion. We begin by defining a few auxiliary 
notions. First of all, we consider the thresholds: 

k (x k ,x k \z k -i) :=min|a>0: Q k (I {xt} - al {ik} \z k -i) < j (27) 

for all k e {1, . . . ,n}, Zk-i G &k-\ an dx k ,x k e %~ k . 
Next, we define 

ar x (x k )-= max a(z fcn ) and J3f := max j3( Zfe:n ) (28) 

Z/fc=-«/6 Zk= x k 

for all A: e {1, . . . ,n} and x k € Using Equations (20)-(21), these can be calculated 
efficiently using the following backward recursive (dynamic programming) procedure: 

c$ a *(x k ) = max ar^(z k+l )S k ({ok}\x k )Q k+1 ({z k+l }\x k ) 

= S k ({ok}\x k ) max a^(z k+1 )Q k+l ({z k+i }\x k ), (29) 



and 



= max m X M + i)mok}\xk)Q k+l {{z k+ i}\xk) 

= S k ({ok}\x k ) max ^!(z k+1 )Q k+1 ({z k+l }\x k ), (30) 

Zk+l^^k+l 



for A: e { 1 , . . . , n — 1 }, starting from 
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cC ax (*„) - a(x n ) = S n ({o n }\x n ) and j3„ max (x„) = j3(x„) - S n {{o n }\x n ). (31) 
Finally, we let 

< Pt (**|z*-i) : = ma * Pr X Me k (x k ,x k \z k ^), (32) 

x k ex k 

Xk¥=*k 

for all k e {1,. . . ,n}, z k -\ € andi^ e 

6.2. Reformulation of the optimality condition. It turns out that the a^ pt (x k \z k -i), cal- 
culated by Equation (32), are extremely useful. As proved in Appendix A, they allow us to 
significantly simplify Equation (14) as follows: 

Opt($r fc „|zfc_i,0 fc „) = \x k :n G Cand(^ k:n \zk-l,Ot.n) ■ «(**:«) > «ft ^ k*- 1 ) } . ( 33 ) 

which, for = n, reduces to 

opt(,r„\z„-i,o„) = {x n e JT„: a(x„) > a„ opt (i„ |z„_i)} . (34) 

6.3. A recursive solution method. The aim of the algorithm is to determine the set 
opt(^i : „|oi :B ) efficiently. We will do so recursively. 

For k = n, opt(^|z„_i,o„) can be determined in a straightforward manner for every 
Zn-i G 2£ n -\ using Criterion (34). 

Example 2. We consider a simple binary HMM with SC= {0, 1}. For k = n, the maximal 
elements are simply states, which are trivially represented. We could for example find that 
opt(^r„|0,o„) = {0,l}forz B _i=0,andopt(^"„|l,o„) = {0}forz„_i = l. ' ♦ 

Next, we let k run backward from n — 1 to 1. For each k < n and all z k -\ G 2£ k -\, we first 
build up the set cand(^ k:n \z k -i,o k:n ), using its definition in Equation (23) and the results 
of the previous recursion step. This set is then used to determine opt {& k \z k -i, o k:n ) with 
Criterion (33). 

Example 3. We continue the discussion of Example 2. For k = n — 1 and z n -i = 0, the 
set cand {3£ n -\ :n \Q ,o n -\ :n ) is constructed using Equation (23). If, for instance Pos„_i (0) = 
{0, 1}, this reduces to Equation (25) and we find that 

cand( JT„_i :n |0, On— \:n )= (J z„-ieopt(jr„| Zn _ 1 ,o„) 

Z„-1G{0,1} 

= oe{o,i}uie{o} = {00,01} u{io} = {00,01,10}. 

Applying Criterion (33) to every element of this set, we find the set opt($^_i : „|0,o„_i : „), 
which for instance could be equal to {00, 10}. For z n -2 — 0, an analoguous method can be 
used. ♦ 

Continuing in this way, we eventually reach k = 1, which yields the desired set of 
maximal sequences opt(^i :n |oi :n ) = opt(^i :n \zo,oi :n ). 

The possible bottleneck in this solution lies in the use of Criterion (33). While this cri- 
terion is already much more efficient than the original one, it can still lead to an exponential 
complexity if the set cmd(% k:n \zk-i,o k:n ) has a number of elements that is exponential in 
the length of the considered sequences. We therefore present a method that avoids checking 
the inequality in Criterion (33) for all elements of cand(^ kn \z k -i, o k:n ). 

The first trick consists in using an efficient data structure to store the sets of optimal 
sequences. For k = n, this is simply a list of the elements. For k < n, we could also just list 
the optimal sequences, but this would imply storing the same information multiple times, 
since parts of those sequences will be the same. We therefore choose to represent this list of 
optimal sequences as a collection of tree structures. The way these trees are constructed 
should be obvious from the following example. 
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Example 4. Consider the following set of sequences: 

{00001000, 00001010, 00001 1 10, 0001 1 1 10, 10001010, 10001 1 10} 

By representing this set in this way, useful information gets lost and memory space is 
waisted. For example, some of these sequences all start out the same way. It would be much 
more efficient to store such common subsequences only once. 




We therefore prefer to represent the above set as the collection of trees depicted above. ♦ 

The next step is now to exploit this data structure in order to apply Criterion (33) 
efficiently. We start by constructing the set cand ( S^k-.n \Zk- i,ot.n) an d representing it in the 
same type of data structure. 

Example 5. We consider the set of sequences in Example 4 to be opt(^- + i„| 0,o^ + i„), 
where k = n — 8, since the length of the sequences is 8. Suppose we have already constructed 
this set in the previous recursion step. Furthermore, for the sake of this example, lets assume 
that G Pos,t_i (0) and 1 ^ Pos^_i (0). We will now use Equation (23) to construct the set 

cand(JT <:: „|0,o jt:n ): 

cand ( JT fcn 1 0, o k:n ) = © opt ( ST k+lm |0 ) U 1 © &k+V.n- 

The set cand {3£ k :n \^i°k:n) consist of two subsets, which we will construct separately. The 
subset 0©opt(^+i : „|0,Ofe_)_i :n ) would normally take quite some effort to compose, since 
we have to concatenate with each individual element of opt(%k+i:n\Qi°k+i:n)- However, 
using our representation, this comes down to adding one node and two links to the already 
existing data structure for opt(^l + i : „|0,o J t + i : „): 




V 'J 

Conceptually, we want to represent the set 1 © 3fck+v.n a s a tree, which would look like the 
figure below on the left. 
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1© JTj 



k+l:n 



V 



k+l:n 




II ( 

1 © =^+l:n ( 1 h <%+l:n 

^ J 1 — ^ J 



However, storing it this way in a computer is a bad idea, as this would mean constructing a 
complete binary tree, which is exponential in the depth of this tree. We therefore remember 
that the set of sequences can be represented as a tree, without actually constructing it, as is 
depicted above on the right. 



cand(^ tn \0,o k:n ) opt(^ + i :n | 




k+\:n 



V 



The set cand [3^ k : „|0,Ofc„) we are looking for is then trivially constructed by joining the 
two subsets © opt ( \. n |0 , 0fc+i:n) and 1 © 5£k+ 1 : ,„ as depicted above . ♦ 

It follows from Equation (33) that the data structure representing opt(^i e:n \zk-iiO k:n ) is 
contained in the data structure representing cmd.{3£ k - n \zk-iiO k -„). All that is now left to do 
is find this subset in an efficient manner. We present a method that constructs a subset of 
cand.{3£ k - n \zk-\,o k -„), and will prove that this subset is indeed opt (3£k:n\Zk-ii°k:n)- 

We first define 0t£ p (zk:s\zk-l) for every k £ {1,...,«}, s G {k,...,n}, Zk-\ € &>k-l 
and zts S !%k;s- l£s = k, we let a% p (zk-.k\Zk-i) '■= a k? (z*|z*-i)> defined by Equation (32). 
0^ l {zk:s\z k -\) is then recursively defined by 



ul v \zt. s \zk-\) 



a^ l {zk:s-i\z k -\) 

S s -l({o s ^}\Zs-l)Q s ({Zs}\z s -l) 



for every s g {£ + 1, ... ,«}. (35) 



Optimal tree construction. The following method will select a subset out of a given set 
cand(<£fc n |z£_i,0fc n ) constructed using Equation (23). 
First, for every Xk £ 3£ k , check whether 



dT(x k ) > a^(x k \zk-i). 



(36) 



From now on, we will use the generic notation x^ for those x^ £ ^ for which this condition 
is satisfied. 

Next, choose an arbitrary x^ and check, for every x k+ i £ ^ k +i that has a non-empty set 
candf^Q^j {3>k:n\Zk-\>°k:n)> if the following condition is satisfied: 

cCi ( x k+ 1 ) > < Pt (** 9 Jtfc+i |z*-i) ■ 



(37) 
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Notice that a^ p (x k ®Xk+i\zk-i) can be easily calculated using Equation (35), because 



a^ pt (xic\Zk-i) is already known from the previous recursion step. Denote those Xk+i € ^jfc+i 
for which the inequality (37) is true genetically by % + i and concatenate them with the state 
Xk, creating a set of state sequences x k :k+i- Do this for every x k of the previous step and 
bundle the sets, obtaining a larger set of state sequences %k:k+i- 

In a next step, consider an arbitrary Xk:k+i and check, for every x k+ 2 S S^k+i that has a 
non-empty set cand^.^^^ (3£t.n\ z k-ii°k:n)> if the following condition is satisfied: 

cC a 2 x (^+ 2 ) > < pt (4.*+i ®x k+2 \z k -x). (38) 

As before, Ct^ (xtk+i © -X/t+2 kit- 1 ) can be calculated easily using Equation (35), since 
a^ pt (i/t+i \ik- i ) has already been calculated in the previous step. Denote those x k+ 2 g 3&k+2 
for which the inequality (38) holds generically by x k+ 2 and concatenate them with Xk±+i, 
creating a set of state sequences Xk:k+2- Do this for every Xfck+l from the previous step and 
bundle the sets to obtain a larger set of state sequences xt.k+2- 

It should be clear that we can go on this way, to eventually end up with a set of sequences 
xt.n-i- Now consider an arbitrary ^ k :n-\ and check, for every x„ 6 S£ n that has a non-empty 
set candi /Srn _ l <$ Xn (3£k.n\zk-\iOk.n), if the following condition holds: 

a„ max (-*«) > a° pt (* fcn -i ®x n \ Zk -i). (39) 

Denote those x„ <E X n for which this is the case as i„, and concatenate them with j?fc B _i, 
creating a set of state sequences x kn . Do this for every xt.n-i from the previous step and 
bundle the sets to finally obtain a set of state sequences x k - n , which is a subset of the set 
cand (^k:n\zk-i,Okn) we started out from. 

Theorem 4. The subset of cand(^ : „|z J t_i,o J ( r: „) that is obtained by using the optimal tree 
construction is equal to opt(^j cn \zk-i,ot n )- 

Example 6. We continue with Example 5. Following the optimal tree construction, we start 
by checking for every x k € {0, 1} whether a™ ax (x,t) > (^'(x^O). Suppose this is the case. 
We will symbolise this by giving the corresponding nodes in our representation a green 
colour, as in the leftmost part of the figure below. It then follows by Theorem 4 that every 
sequence in opt( j2^ : „|0,O(. : „) will either start with or 1, since the set of x k is {0, 1}. In 
this example, this is of course trivial, but if the set of x k would have been {0}, we would 
have obtained the non-trivial result that every sequence in opt( ^ „|0,o<.„) starts with 0. 
We can represent this partial information about the set opt (^k:n\0,ot. n ) in a trivial way, as 
in the rightmost part of the figure below. 



cand(J^ : „|0,o <:: „) opt (& k +i:n\ 0,o jt+1: „) 




J 



0- 



k+\:n 



V 



J 




In the next step, we need to check some criteria for every x k we have found in the previous 
step. We begin with x k = and start by looking at x k+ \ — 0. The set cand^ eXi:+1 (■%'k:n\Q,Ok:n) 
is then candoo {^k:n\Q,°k:n), which is simply the subset of sequences in cand(^. :j ,|0,o^ :n ) 
that start with 00. In our tree representation of cand (3£k:n\®i°k:n), checking whether this set 
is non-empty comes down to checking if the node x k = has a daughter with value 0. Since 
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this is indeed the case, we need to check whether a™f (0) > tf p \x k ®x k+1 10) = c£ pt (00|0). 
Suppose this criterion is met, then we have found our first subsequence x k - k+ \, namely 00. 
We symbolise this in the figure below by giving the child x k+ \ — of the node x k = a 
green colour. 

The node x k = also has a daughter x k+ \ = 1. If a™\ (1) < «^ pt (01 10), this daughter 
gets coloured red and 01 is not part of the set of sequences x k ±+\ we are constructing in 
this step. By Theorem 4, this also means that none of the elements of opt {^ k - M \Q,o k - n ) will 
start with the subsequence 01. 

For.% = 1, we know that the tree representing the sequences in cand (^' k „\Q , o k -„) that 
start with 1 is a complete tree, which we have not explicitly constructed. This does not create 
a problem, since we only need that tree to check whether candi^^ (^t n \z k -i ,o km ) is a 
non-empty set, which is a condition that is trivially met for all x k+ \ 6 ^k+x because of the 
completeness of the set candi ( 3^ k:n \z k - \ , o k n ) . We are therefore left to check Criterion (37) 
forx,t = 1 and every x k+ \ 6 {0, 1}. Forx^+i =0, we might for instance find that a™j(0) < 

a° pt (10|0) and forx^ = 1 we might find that a^(l) > a° pt (H|0). 

The results of these checks are summarised in the leftmost part of the figure below. The 
corresponding sequences x k k+ i, which by Theorem 4 are the possible starting sequences for 
the elements of opt {^t„\Q,o k - n ), can be easily stored and depicted in our tree representation; 
see the rightmost part of the following figure. 




If we keep performing the steps of optimal tree construction in this way, Theorem 4 states 
that the data structure that is built up while checking all these criteria represents the set 
opt(^ k „\Q,o kn ). This set might look like this: 




Figure 3 should clarify how this set was constructed. Notice that we have indeed never 
explicitly constructed the set =5^1+ 1 : „ in the tree representation, since every time we reached 
a red node, the descendants of this node were not constructed. ♦ 
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cand(^tn\0,ot.„) 



Opt(^+l:„| 0,O k+l:n ) 




<%k+l:n 



Figure 3 . Clarification of the construction of cand (<$£fc n |0, o k:n ) 



6.4. Additional comments. All that is needed in order to produce the a- and j3 -functions 
are assessments for the lower and upper transition and emission probabilities: 

Q k {{ik\\zk-i), fit({zk}|z*-i). S k ({o k }\ Zk ) zndS k ({ok}\zk) 
for all k £ {1, • • • ,«}, Zfc-i € 3£ k -\> Zk € 3&k an d °k € @k- The most conservative coherent 
models (^('l-^fc-i) that correspond to such assessments are 2-monotone [3, 7]. Due to their 
comonotone additivity [7], this implies that: 

Qk ~ al ih } = Q k ({ x k}h- 1 ) - aQkiih) \zk- i ) 
for all a > 0, and therefore Equation (27) leads to 

Q k (U k }\z k -i) 



0k(x k ,x k \z k - 



(40) 



Qk{{h}\zk-i) 

The right-hand side is the smallest possible value of the threshold k (x k , x k \zk- 1 ) correspond- 
ing to the assessments QA{xk\\Zk-\) an d Qk({%k}\z k -i), leading to the most conservative 
inferences, and therefore the largest possible sets of maximal sequences, that correspond to 
these assessments. 



7. Discussion of the algorithm's complexity 

7.1. Preparatory calculations. We begin with the preparatory calculations of the quantit- 
ies in Equations (27)-(32). For the thresholds k (x k ,x k \z k -i) in Equation (27), the computa- 
tional complexity is clearly cubic in the number of states, and linear in the number of nodes. 
Calculating the a™ ax (x^.) and j5 k mix (x k ) in Equations (29) and (30) is linear in the number 
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of nodes, and quadratic in the number of states. The complexity of finding the {xk\zk-i) 
in Equation (32) is linear in the number of nodes, and cubic in the number of states. 

7.2. Complexity of the optimal tree construction. The computational complexity of the 
optimal tree construction is less trivial. Let us start by noting that this construction essentially 
consists in repeating the same small step over and over again, namely adding a state x s to an 
already constructed Xk:s-\- 

To perform such a step for a sequence x~k:s-i> we first have to check for all x s £ Sf s 
whether candj?.. _ x i$ Xs (3£t.n\Zk- \ ,Ok; n ) is non-empty. This can be done in constant time, 
since our representation reduces this step to checking whether the node x s is a daughter of 
Jtj_ i in the data structure of candj A . , ( \zk- 1 , ot. n ) ■ Next, for those x s £ 3£ s for which 
this is indeed the case, we need to check if af mx (x s ) > a^ pt (xt.s- 1 ®x s \zk-i)- Checking 
those two criteria for every x s £ Stf s will from now on be called performing a search step, 
and its complexity is linear in the number of states. Those x s £ 3£ s that meet both criteria 
will be noted as x s and concatenated with xt.s-i- 

We will now prove that performing such a search step will always yield at least one x s 
that can be concatenated with Xt.s- 1 ■ 

Theorem 5. Consider an arbitrary sequence i^j-i that is created while performing the 
optimal tree construction, with k £ {1, ...,«} and s £ {k, ...,n}. Then there is always at least 
one x s £ 9f s for which both candje fc _ x %x s [3£k:n\Zk-\->°k:n) is non-empty and the inequality 
af ax (x. ( ) > a° pt (4 :s _! ®x s \z k -i) holds. 

Example 7. In our visual representations, this means that every green node will alway have 
at least one green child, which implies that all green sequences will have length n — k + 1. 




The situation depicted above is therefore impossible. ♦ 

Next, notice that every optimal sequence Xk-.n yielded by the optimal tree construction 
is built up by adding extra states x s to an already constructed sequence x.k- s - 1 > repeating 
this for s going from k to n. Adding such a state means performing one search step, but 
Theorem 5 implies that performing a search step also means adding at least one state. 
Therefore, constructing one maximal sequence Xk-.n will never take more search steps than 
the length of this sequence. Since performing one search step is linear in the number of 
states, constructing one maximal sequence is linear in the length of the sequence and the 
number of states. Determining the set opt(^fk:n\zk-i,°t.n) of all maximal sequences will 
thus be linear in the number of sequences, in the length of the sequences and in the number 
of states. 

7.3. The recursive construction of the solutions. To construct opt (^i : „ |oi :n ) recursively, 
we let k run from n to 1 . For a fixed k, we construct the set opt ( 3£k:n |zjt- i , 0fcn ) for every 
Zk-i £ $>k-it by means of the optimal tree construction. We have already shown that 



If s = k, we identify .% :J _i = %k:k-i with a sequence of length zero. 
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constructing such a set is linear in the number of sequences, the length of the sequences and 
the number of states. This means that performing this recursive construction is quadratic in 
the length of the sequences, quadratic in number of states and roughly speaking 8 linear in 
the number of maximal sequences. 

7.4. General complexity. The complete algorithm consist of the preparatory calculations 
and the recursive construction of the solutions. We conclude that it is quadratic in the 
number of nodes, cubic in the number of states, and roughly speaking linear in the number 
of maximal sequences. 

7.5. Comparison with Viterbi's algorithm. For precise HMMs, the state sequence es- 
timation problem can be solved very efficiently by the Viterbi algorithm [12, 14], whose 
complexity is linear in the number of nodes, and quadratic in the number of states. However, 
this algorithm only emits a single optimal (most probable) state sequence, even in cases 
where there are multiple (equally probable) optimal solutions: this of course simplifies the 
problem. If we would content ourselves with giving only a single maximal solution, the 
ensuing version of our algorithm would have a complexity that is similar to Viterbi's. 

So, to allow for a fair comparison between Viterbi's algorithm and ours, we would need 
to alter Viterbi's algorithm in such a way that it no longer resolves ties arbitrarily, and emits 
all (equally probable) optimal state sequences. This new version will remain linear in the 
number of nodes, and quadratic in the number of states, but will also have added complexity. 
This can easily be seen by noting that emitting the optimal sequences will be linear in the 
number of them and thus possibly exponential, if all possible solutions would for example 
be equally probable. 

For the complexity for the most time-consuming part of our algorithm (the recursive 
construction of the solutions), the only difference is this: Viterbi's approach is linear and 
ours is quadratic in the number of nodes. Where does this difference come from? In iHMMs 
we have mutually incomparable solutions, whereas in pHMMs the optimal solutions are 
indifferent, or equally probable. This makes sure that the algorithm for pHMMs requires 
no forward loops, as is the case in the EstiHMM algorithm, when we perform the optimal 
tree construction. We believe that this added complexity is a reasonable price to pay for the 
robustness that working with imprecise-probabilistic models offers. 

8. Some experiments 

While a linear complexity in the number of maximal sequences is probably the best we 
can hope for, we also see that we will only be able to find all maximal sequences efficiently 
provided their number is reasonably small. Should it, say, tend to increase exponentially with 
the length of the chain, then no algorithm, however cleverly designed, could overcome this 
hurdle. Because this number of maximal sequences is so important, we study its behaviour 
in more detail. In order to do so, we take a closer look at how this number of maximal 
sequences depends on the transition probabilities of the model, and how it evolves when we 
let the imprecision of the local models grow. We shall see that this number displays very 
interesting behaviour that can be explained, and even predicted to some extent. To allow 
for easy visualisation, we limit this discussion to binary iHMMs, where both the state and 
output variables can assume only two possible values, say and 1. 

8.1. Describing a binary stationary iHMM. We first consider a binary stationary HMM. 
The (precise) transition probabilities for going from one state to the next are completely 
determined by numbers in the unit interval: the probability p to go from state to state 0, 
and the probability q to go from state 1 to state 0. To further pin down the HMM we also 
need to specify the (marginal) probability m for the first state to be 0, and the two emission 



°For every k, constructing the set opt ( X^ n \ Zk- 1 , Ojfc„ ) has linear complexity in the number of maximal elements 
at that stage. 
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probabilities: the probability r of emitting output from state and the probability s of 
emitting output from state 1. 

In this binary case, all coherent imprecise-probabilistic models can be found by con- 
tamination: taking convex mixtures of precise models, with mixture coefficient 1 — £, and 
the vacuous model, with mixture coefficient e, leading to a so-called linear-vacuous model 
[15]. To simplify the analysis, we let the emission model remain precise, and use the same 
mixture coefficient e for the marginal and the transition models. As e ranges from zero to 
one, we then evolve from a precise HMM towards an iHMM with vacuous marginal and 
transition models (and precise emission models). 

8.2. Explaining the basic ideas using a chain of length two. We now examine the beha- 
viour of an iHMM of length two, with the following (precise) probabilities fixed: 9 

m = 0.1,r = 0.8 and* = 0.3. 

Fixing an output sequence and a value for e, we can use our algorithm to calculate the 
corresponding numbers of maximal state sequences as p and q range over the unit interval. 
The results can be represented conveniently in the form of a heat plot. The plots below 
correspond to the output sequence o 1:2 = 01. 

The number of maximal 
state sequences clearly de- 
pends on the transition prob- 
abilities p and q. In the 
rather large parts of 'probab- 
ility space' that are coloured 
white, we get a single maximal 
sequence — as we would for 
HMMs — , but there are con- 
tiguous regions where we see 
a higher number appear. In the 
present example (binary chain 
of length two), the highest pos- 
sible number of maximal se- 
quences is of course four. In 
the dark grey area, there are 
three maximal sequences, and 
two in the light grey regions. 
The plots show what happens 
when we let e increase: the 
grey areas expand and the number of maximal sequences increases. For e = 15%, we 
even find a small area (coloured black) where all four possible state sequences are maximal: 
locally, due to the relatively high imprecision of our local models, we cannot give any useful 
robust estimates of the state sequence producing the output sequence o\-2 = 01. 

For small e, the areas with more than one maximal state sequence are quite small and 
seem to resemble strips that narrow down to lines as e tends to zero. This suggests that we 
should be able to explain at least qualitatively where these areas come from by looking 
at compatible precise models: the regions where an iHMM produces different maximal 
(mutually incomparable) sequences, are widened versions of loci of indifference for precise 
HMMs. 

By a locus of indifference, we mean the set of (p,q) that correspond to two given state 
sequences x\a and Jci :2 having equal posterior probability: 

p{x\:2\oi: 2 ) = p{x\a\o\a), 




This choice is of course arbitrary. Different values would yield comparable results. 
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or, provided that p{o\ji) > 0, 



P(xi:2,0uz) =P (^l:2,Oi:2). 



In our example where o\-2 = 01, we find the following expressions for each of the four 
possible state sequences: 



p(00,01) = mr{\-r)p 
p(01,01)=mr(l-j)(l-p) 
p(10,01) = (l-m)si(l-r)9 
p(U,0l) = (l-m)s(l-s)(l-q) 



By equating any two of these expressions, we express that the corresponding two state 
sequences have an equal posterior probability. Since the resulting equations are a function 
of p and q only, each of these six possible combinations defines a locus of indifference. All 
of them are depicted as lines in the following figure. 

Parts of these loci, depicted 
in blue (darker and bolder in \ 
monochrome versions of this pa- 
per) demarcate the three regions 
where the state sequences 01, 
10 and 11 are optimal (have the 
highest posterior probability). 

What happens when the 
transition models become impre- 
cise? Roughly speaking, nearby 
values of the original p and 
q enter the picture, effectively 
turning the loci (lines) of in- 
difference into bands of incom- 
parability: the emergence of re- 
gions with two and more max- 
imal sequences can be seen to 
originate from the loci of indif- 
ference; compare the figure for 
these loci with the heat plots 
given above. 




8.3. Extending the argument to a chain of length three. For a chain of length three, we 
can determine the loci of indifference for precise models in a completely analogous manner. 
If we use the same marginal model and emission model as in the previous example, the 
resulting lines of indifference for the output sequence 000 look as follows. 
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e = 2% 






e = 5% 












If we compare this with the 
visualisation of the num- 
ber of maximal elements 
for the same sequence, the 
resemblance is again quite 
striking. Not that in this 
example, the black areas 
correspond to a number of 
maximal sequences that is 
at least four. 



9. Showing off the algorithm's power 

In order to demonstrate that our algorithm is indeed quite efficient, we let it determine 
the maximal sequences for a random output sequence of length 100. 

We consider the same binary stationary HMM as we presented above, but with the 
following precise marginal and emission probabilities: 

m = 0.1, r = 0.98 and s = 0.01. 

In practical applications, the probability for an output variable to have the same value 
as the corresponding hidden state variable is usually quite high, which explains why we 
have chosen r and s to be close to 1 and to 0, respectively. In contrast with the previous 
experiments, we do not let the transition probabilities vary, but fix them to the following 
values: 

p = 0.6 and q = 0.5. 

The iHMM we use to determine the maximal sequences is then generated by mixing 
these precise local models with a vacuous one, using the same mixture coefficient £ for 
the marginal, transition and emission models. In Figure 4, we display the five maximal 
sequences corresponding to the highlighted output sequence, and £ = 2%. Since the emission 
probabilities were chosen to be quite accurate, it is no surprise that the output sequence 
itself is one of the maximal sequences. In addition, we have indicated in bold face the state 
values that differ from the outputs in the output sequence. We see that the model represents 
more indecision about the values of the state variables as we move further away from the 
end of the sequence. This is a result of a phenomenon called dilation, which — as has been 
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noted in another paper [4] — tends to occur when inferences in a credal 
tree proceed from the leaves towards the root. 

As for the efficiency of our algorithm: it took about 0.2 seconds to 
calculate these 5 maximal sequences. 10 The reason why this could be 
done so fast is that the algorithm is linear in the number of solutions, 
which in this case is only 5. If we let £ grow to for example 5%, the 
number of maximal sequences for the same output sequence is 764 
and these can be determined in about 32 seconds. This demonstrates 
that the complexity is indeed linear in the number of solutions and that 
the algorithm can efficiently calculate the maximal sequences even for 
long output sequences. 

10. An application in optical character recognition 

As a first and simple toy application, we use the EstiHMM al- 
gorithm to try and detect mistakes in words. A written word is re- 
garded as a hidden sequence x\ :n , generating an output sequence o\- n 
by artificially corrupting the word. In this way, we simulate observa- 
tion processes that are not perfectly reliable, such as the output of an 
Optical Character Recognition (OCR) device. This leads to observed 
output sequences that may contain errors, which we will try and detect. 
We compare our results with those of the Viterbi algorithm and show 
that our algorithm offers a more robust solution. 
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10.1. Generating the HMM. A local uncertainty model must be >-> 

identified for each original and observed letter: a marginal model Q x — 

for the first letter X\ of the original word, a transition model Q. ( • i ) o 

for the subsequent letters X^, with k g {2, . . . ,«}, and an emission 

model 5j(-|Xjt) for the observed letters Ok, with k g {1 , . . . , «}. For the 

sake of simplicity, we assume stationarity, making the transition and g 

emission models independent of k. g 

For the identification of the local models of the iHMM, we use the ^ 

imprecise Dirichlet model (IDM, [16]). For example, for the marginal o 

model Q v applying the IDM leads to the following simple identific- g 

ation: ^ 

o 

.? 4- n„ o 



» • L. / ». v • L. /". 5 

o 

where n z counts the words in the sample text for which the first letter >-» 

X\ = z and s is a (positive real) hyperparameter that expresses the o 

degree of caution in the inferences. In this example, we let s =2. © 

For the transition and emission models, we can proceed similarly, by g 

counting the transitions of one character to another, respectively in the g 

original word or during the observation process. In this way we obtain >-* 

lower and upper transition and emission probabilities for singletons, © 

which, as pointed out in Section 6.4, suffice to run the algorithm. Note g 

that if s were chosen to be zero, the local models would become precise g 

and the EstiHMM algorithm would reduce to the Viterbi algorithm (or © 

a version of it that does not resolve ties arbitrarily, see Section 7.5). ~ 

For the identification of the local models in the precise HMM, ^ 

we use a similar but now precise Dirichlet model approach, with a g 

Perks's prior that has the same prior strength s = 2. As an example, *-* 
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10 Running a Python programme on a 2012 MacBookPro. FIGURE 4 
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for the precise marginal model Q\, this leads to the following simple 
identification: 

Gi(w) = TZr ' 

where | S£\ is the number of states. 

10.2. Results. Let us first discuss a specific example of the difference between the actual 
results we obtained using the Viterbi and the EstiHMM algorithms, in order to illustrate 
an important advantage of the latter. OCR software has mistakenly read the Italian word 
QUANTO as OUANTO. Using a precise model, the Viterbi algorithm does not correct this 
mistake, as it suggests that the original correct word is DUANTO. The EstiHMM algorithm 
on the other hand, using an imprecise model, returns CUANTO, DUANTO, FUANTO and 
QUANTO as maximal (undominated) solutions, including the correct one. Of course we 
would still have to pick the correct solution out of this set of suggestions — for example by 
using a dictionary or a human opinion — , but by using the EstiHMM algorithm, we have 
managed to reduce the search space from all possible five letter words to the much smaller 
set of four words given above. Notice that the solution of the Viterbi algorithm is included 
in the maximal solutions EstiHMM returns. One can easily prove that this will always be 
the case. 

To simulate an OCR device, we have artificially corrupted the first 200 words of the first 
canto of Dante's Divina Commedia, resulting in 137 correctly read words and 63 words 
containing errors. We try and correct these errors using both the EstiHMM and the Viterbi 
algorithm, and compare both approaches. The results are summarised in the following table. 

total number correct after OCR wrong after OCR 

total number 200(100%) 137 (68.5%) 63(31.5%) 

Viterbi 

correct solution 157 (78.5%) 132 25 

wrong solution 43(21.5%) 5 38 
EstiHMM 

correct solution included 172(86%) 137 35 

correct solution not included 28(14%) 28 

For the Viterbi algorithm, the main conclusion is that applying it to the output of the OCR 
device results in a decreased number of incorrect words. The number of correct words rises 
from 68.5% to 78.5%. However, the Viterbi algorithm also introduces new errors for 5 
correctly read words. 

The EstiHMM algorithm manages to suggest the original correct word as one of her 
solutions in 86% of the cases. Assuming we are able to detect this correct word, the 
percentage of correct words rises from 68.5% to 86% by applying the EstiHMM algorithm, 
thereby outperforming the Viterbi algorithm by almost 10%. Secondly, we also notice that 
the EstiHMM algorithm has never introduced new errors in words that were already correct. 

Of course, since the EstiHMM algorithm allows for multiple solutions, instead of a single 
one, it is no surprise that we manage to increase the amount of times we suggest the correct 
solution. This would happen even if we added random extra solutions to the solution of 
the Viterbi algorithm. Giving extra solutions can only be seen as an improvement if this is 
done smartly. To investigate this, we distinguish between the cases where the EstiHMM 
algorithm returns a single solution, and those where it returns multiple solutions; and look 
at how the Viterbi and EstiHMM algorithms compare in those two cases. 

The EstiHMM algorithm returned a single solution for 155 of the 200 words. As we have 
already mentioned above, this single solution will always coincide with the one given by the 
Viterbi algorithm. The results for the EstiHMM (and Viterbi) algorithms are summarised in 
the following table. 
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EstiHMM (single solutions) total number correct after OCR wrong after OCR 

total number 155 (100%) 129 (83.2%) 26(16.8%) 

single correct solution 134(86.5%) 129 5 

single wrong solution 21(13.5%) 21 

The percentage of words correctly read by the OCR software is now 83.2% instead of the 
global 68.5%. When the result of the EstiHMM algorithm is a single solution, this serves as 
an indication that the word we are trying to correct has a fairly high probability of already 
being correct. We also see that the eventual percentage of correct words is 86.5%, which 
is only a slight improvement over the 83.2% that were already correct before applying the 
algorithms. 

Next, we look at the remaining 45 words, for which the EstiHMM algorithm returns 
more than one maximal element. In this case, we do see a significant difference between the 
results of the Viterbi and the EstiHMM algorithm, since the Viterbi algorithm never returned 
more than one solution. 11 The results for both algorithms are listed in the following table. 



total number correct after OCR wrong after OCR 



total number 45(100%) 8(17.8%) 37(82.2%) 
EstiHMM (multiple solutions) 

correct solution included 38 (84.4%) 8 30 

correct solution not included 7(15.6%) 7 
Viterbi 

correct solution 23 (5 1 . 1 %) 3 20 

wrong solution 22(48.9%) 5 17 



A first and very important conclusion to be drawn from this table, is that EstiHMM's being 
indecisive serves as a rather strong indication that the word we are applying the algorithm 
to does indeed contain errors: when the EstiHMM algorithm returns multiple solutions, the 
original word has been incorrectly read by the OCR software in 82.2% of cases. 

A second conclusion, related to the first, is that EstiHMM's being indecisive also serves as 
an indication that the result returned by the Viterbi algorithm is less reliable: the percentage 
of correct words after applying the Viterbi algorithm has dropped to 51.1%, in contrast 
with the global percentage of 78.5%. The EstiHMM algorithm, however, still gives the 
correct word as one of its solutions in 84.4% of cases, which is almost as high as its global 
percentage of 86%. If the set given by the EstiHMM algorithm contains the correct solution, 
the Viterbi algorithm manages to pick this correct solution out of the set in 60.5% of cases. 
We see that the EstiHMM algorithm seems to notice that we are dealing with more difficult 
words and therefore gives us multiple solutions, between which it cannot decide. 

We conclude from this experiment that EstiHMM can be usefully applied to make the 
results of the Viterbi algorithm more robust, and to gain an appreciation of where it is 
likely to go wrong. If the EstiHMM algorithm returns multiple solutions, this serves as 
an indication for robustness issues that would occur if we solved the same problem with 
the Viterbi algorithm. In that case, EstiHMM returns multiple solutions, between which it 
cannot decide, whereas the Viterbi algorithm will pick one out of this set in a fairly arbitrary 
way — depending on the choice of the prior — , thereby increasing the amount of errors made. 
The advantage of our method is that it detects such robustness issues, leaving us with the 
option of solving them in different ways. A first method would be to pick the correct word 
out of the set of possible solutions in some non-arbitrary way. For the current application 
this could be done using a dictionary or a human expert. Another method for dealing with 
robustness issues would be to conclude that we need more data in order to build a better 
model, less sensitive to the choice of the prior. After applying the EstiHMM algorithm again, 



In theory, the Viterbi algorithm can return multiple indifferent solutions, but in practice it almost never does. 
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using the new model, we could check whether the robustness issues have been satisfactorily 
dealt with. 

11. Conclusions 

Interpreting the graphical structure of an imprecise hidden Markov model as a credal 
network under epistemic irrelevance leads to an efficient algorithm for finding the maximal 
(undominated) state sequences for a given output sequence. Preliminary simulations show 
that, even for transition models with non-negligible imprecision, the number of maximal 
elements seems to be reasonably low in fairly large regions of parameter space, with high 
numbers of maximal elements concentrated in fairly small regions. It remains to be seen 
whether this observation can be corroborated by a deeper theoretical analysis. 

A first and simple toy application clearly shows that the EstiHMM algorithm is able to 
robustify the results of the Viterbi algorithm. Not only does it reduce the amount of wrong 
conclusions by giving extra possible solutions, but it does so in an intelligent manner. It 
adds extra solutions in the specific cases where the Viterbi algorithm has robustness issues, 
thereby also serving as an indicator of the reliability of the result given by the Viterbi 
algorithm. An interesting further avenue of research would be to compare the EstiHMM 
algorithm with other methods that also try to robustify the Viterbi algorithm. Although most 
of these methods start from a precise model and introduce safety rather than imprecision 
by for example trying to find the k most probable solutions, their practical applications are 
similar. A comparison of their results with ours could therefore prove to be interesting. We 
leave this as a topic of future research. 

It is not clear to us, at this point, whether ideas similar to the ones we discussed above 
could be used to derive similarly efficient algorithms for imprecise hidden Markov models 
whose graphical structure is interpreted as a credal network under strong independence [2]. 
This could be interesting and relevant, as the more stringent independence condition leads to 
joint models that are less imprecise, and therefore produce fewer maximal state sequences 
(although they will be contained in our solutions). 
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Appendix A. Proofs of main results 

In this appendix, we justify the formulas (6), (7), (15), (16), (17), (33) and (34); and 
we give proofs for Proposition 1 and Theorems 2-5 . We will frequently use terms such 
as positive, negative, decreasing and increasing. We therefore start by clarifying what we 
mean by them. For x 6 i, we say that x is positive if x > 0, negative if x < 0, non-negative 
if x > and non-positive if x < 0. We call a real-valued function / defined on R: 

(i) increasing if (Vx,y <E R)(x > y => f(x) > f(y)); 

(ii) decreasing if (Vx,y ER)(x>y^> f(x) < f(y)); 

(iii) non-decreasing if (Vx,y € M)(x > y => f(x) > f(y))', 

(iv) non-increasing if (Vx,y G M)(x > y => f(x) < f(y)). 

Proof of Equation (6). For all k G {1, . . . ,n}, Zk-i € ^fc-i, Zt.n & %k:n and ot M € ffk-.n we 
infer from Equation (5) that 

" Zk-l 



= G* ( E l {x k }E k (l {zk} {Xk)\ M:n }\o k .. n } \Xk) 
\x k eX k 

= Q*(hz k }M\ M:n }ho k .Jzk)\zk-i)- 

Since Kk{^-{z k+ i- n }^{o k . n } kt) > by CI, we see that C2 transforms the above into 

= Q k {l {lk} \zk-i)Mhz k+hn }ho k .Jz k ), 
which can be reformulated as 

= e fc (ita}|z*-i)St(i{ ,}kfc)^+i(ifa +1: „}i{ , +1: „}k*) 

= G t ({zt}|zt-i)5 fc ({oifc}kt)^fc+i(I{ zt+1 . B }I{ 0t+1:B }|zifc). 

if we take into account Equation (4), since P J(+l (^{ Zk+l . n }^{o k+1 . n } \zk) > by CI. 
Repeating these steps again and again eventually yields Equation (6): 

i=k 

In the last step, for k — n,ws have used the equality £„({o„}|z„) = S n ({o n }\z n ), which 
follows from Equation (3). □ 
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Proof of Equation (7). For all k e {1, . . . ,n}, Zk-i € 
infer from conjugacy and Equation (5) that 

^OkW^wl^-i) = -Pk(-hz tn }ho tn }\zk-i) 

= -Q k (U-hz k .J% k JXk)h-i) 

= -e fc (i fa} £ ft (-i fa+1:n} i K: „ } |zft)lz fc -i) 

Since -Ijfc(-I{z t+1: „}E{ Ofc „}|zi0 = ^( I {z t+1: „} I { 0fc „} |z*) > by conjugacy and Lemma 6, we 
see that C2 and Equation (2) transform the above into 

= -(_-^(- I fa +1: „} I K:„}k*))e fc (- I fa}l^-i) 

= -&(I fa} |z*_i)^(-I {z;t+1: „ } I {ofc „ } |z*), 
which can be reformulated as 

= -&( I ta}|z*-l)^( I { ,}kfc)^+l(- I ta +1: „} I {o, +1: „}|Zfe) 

= 2*({zit}|zifc-i)5t({o t }|zifc)Pifc + i(I ta+1:ii} I {ot+1:ii} |zifc), 
using conjugacy and Equation (4), since P J(+l (— I{ zt+1 . B }I{ 0t+1 . B } |z*) < 0- This last inequality 
is true because we know that P k+l (-\z k+1 .. n }\o k+hn } h) = -?t+i OW^Wia.} I 2 *) b y 
conjugacy and that P^ +1 (I{ zt+1:B }I{ 0t+1 . B } |z*) > by Lemma 6. 

Repeating the steps above again and again, eventually yields Equation (7): 

^OWAwte-i) = ne*(te'}l^'-i)^({^}kO- 

In the last step, for k — n,ws have used the equality E n ({o„}\z n ) = S n ({o n }\z n ), which 
follows from Equation (3) and conjugacy. □ 

Lemma 6. Consider a coherent lower prevision P on &(3£). Then min / < P(f) <P(f)< 
max / for all f and P(f) =P{n) = nfor all jj. e M. 

Proof. We prove the inequalities in min/ < P(f) < P(f) < max / one by one. The first 
one is the same as CI. It follows by C3 that P(f — f) > P_{f)+P_{—f) and, since we know 
by C2 that P(0) = 0P(0) = 0, this implies that P(f) < -P(-f) = P(f), using conjugacy 
for the last equality. For the gamble — /, CI yields that min—/ < P(—f) which implies 
that max/ = -min -f^_-P(-f) =P(f). 

To conclude, P(f)—P(n)—jX follows by applying these inequalities for / = jit. □ 

Proof of Proposition 1. Observe that 

f*CWI**-i)= ? *(w E Wk-O >^(W{4„}k-i) >0 ' 

v z fc „e,r fc „ /V / 

where is any element of <^ : „. The equality follows from Y<z k - n eS£ k - n \z kn } ~ 1> me fi rst 
inequality from Lemma 8(ii), and the second one from the positivity assumption (10) and 
Equation (7). 

In the same way, we can easily prove that 



Ek({ot.„}\xk) =E k (l {otn} £ I {zM:n} x k ) > E k i 



x k )>0. 
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This time, we have used the positivity assumption (10) and Equation (9) for the last inequal- 
ity. □ 

Proof of Theorem 2. Consider the function p defined by p (ju) := P(J{ 0l . n ) [^{ Xl . n } — ^{x v „} — 
jit]) for all real \i. It follows from Equation (11) that P(I[ Xl . n y — l°i :«) is P' s rightmost 
zero, and we also know that p(0) = P(I{ 0ln }[I{ Xln } — I^j]). p is non-increasing and 
continuous by Lemma 7(i), and has at least one zero by Lemma 7(ii). Hence, if p (0) > 0, 
then p has at least one positive zero and P(I[ Xl . n y — I{ij.„} \oi :n ) > 0. If p (0) < 0, then p has 
only negative zeroes and P(I{ Xl . n } — „}|oi:n) < 0. Hence, proving the theorem comes 
down to proving that p (0) = implies that p (e) < for all e > 0, since this in turn implies 
that P(I{ Xl . n } — i{ Xl . n }\oi :n ) — 0. We now prove this implication. We consider two different 
cases. 

The case x\ = x\. For any real e > 0: 

p(e)=P(I {0l:n} [I { , 1: „ } -I {jEl: „ } -e]) 

= QSm { o,. n }h x ,. n} -h,,. n} -e]\X,)) 

= Gifai}£i(WPW-W- £ ]l*i)+ E H^M-^Jz,)]. (41) 

The coefficients £j (— eh 0l .\ \zi) can be written as — eE\{{o\ :n }\z\) by conjugacy and C2, 
which makes them negative, decreasing functions of e, since Ei({oi- n }\z\) > by the 
positivity assumption (10) and Proposition 1. 

For the coefficient E_ x (I{ 01 .„} P-fe-n} ~~ ^{h-n} ~ e l l Xl )' we consider two possible cases. 

If Ki (I {01: „ } [I {x2:n] - l {h . n} ] \x x ) > 0, we know that E x (I {oi . n} [I {x2:n} - I {x2 . n} - e] \x x ) is 
a decreasing function of e by Lemma 7(vi). Therefore, the argument of Q x in Equation (41) 
decreases pointwise in e, which by Lemma 8(i) implies that p(e) is a decreasing function 
of e and therefore p(e) < p(0) = 0. 

If, on the other hand, E_ x (I{ 01 .„} [\x 2 . n } ~ \x 2 .„}] l x < 0, we know by Lemma 8(ii) that 

^1 ihoi-.n} h*2:n} ~ H^-.n) ~ ^ 1*0 ^ ' im Plying that 

P(e)<e/ E I {Z1 }£!(-£!{„,„} ko) 

<Q X (l {zi , } £i(-el {oi: „ } |zi*))=-e£i({oi:„}|zi*)ei{zi*}<0. 

In this expression, zu is an arbitrary z\ ^x\. The first two inequalities are due to Lemma 8(ii). 
Conjugacy and C2 yield the equality and the last inequality is a consequence of the positivity 
assumption (10) and Proposition 1. Also in this case, therefore, we find that p(e) < 0. 
The case x\ ^ x\. For any real e > 0: 

p(e)=^(I {0l: „}[I { x 1: „}-I { i 1: „}-e]) 

^(^(WPW-W-*)) 

= Q x (l {Xi } E, (I {01: „ } [l {X2:n} - e] \ X1 ) + I {Xi } E X (l {oi:n] [-I {x2m} - e] \ Xl ) 

+ E ( 42 ) 

In the proof for the case x\ = x\, we have already shown that the coefficients £j (— £l{ 0l , n } \z\ ) 
are negative, decreasing functions of e. Together with Lemma 8(ii), this implies that 
E\ (I{ 01 .„} [ _ I{x 2 -„} _ £ ] 1*0 < £i ( — £ I{oi-„} 1*0 < 0' which in turn by Lemma 7(vii) implies 
that E_ x (E{ 01 .„} [ — \x 2n } ~ e l 1*0 * s a decreasing function of e. All that is left to consider is 
the coefficient E_ x (I{ 01 .„} [\x 2 - n } ~ £ ] l x i )■ There are two possibilities. 

lf Ei(l{ 0l .„}^{x 2: „}\xi) >0, then Lemma 7(vi) implies that E x (I{ 0l: „}[I{^ 2: „} - e] |*i) is 
a decreasing function of e. Therefore, the argument of Q in Equation (42) decreases 
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pointwise in e, which by Lemma 8(i) implies that p(e) is a decreasing function of e and 
therefore p(e) < p(0) = 0. 

If, on the other hand, ^i(I{ 0lai }I{^. B }|^i) = 0, then we know that E x (I {oi . n} [I {x2m} - 
e] \xi) < by Lemma 8(ii), implying that 

P (e) < Q x (I^Ii (I {01: „ } [-I { i 2: „ } - e] |«i)) 

< g, (I^jEj (-£l {oi: „ } |ii)) = -e£i ({oi ;n }|fi )& ({f J) < 0. 

The first two inequalities follow from Lemma 8(ii). Conjugacy and C2 yield the equality, 
and the last inequality is a consequence of the positivity assumption (10) and Proposition 1. 
Also in this case, then, we find that p(e) < 0. □ 

Lemma 7. Let Pbe a coherent lower prevision on £f (Jff). For any f <E ^{SC) and y € W , 
consider the real-valued map p defined on R by p(jll) := P(^{ y ] [f — V\) for all real jj,. Then 
the following statements hold: 

(i) p is non-increasing, concave and continuous. 

(ii) p has at least one zero. 

(iii) If P({y}) > 0, then p is decreasing and has a unique zero. 

(iv) IfP({y}) = 0, then p is identically zero. 

(v) If P({y}) — and P{{y}) > 0, then p is zero on (— °°,P(f\y)}, and negative and 
decreasing on (P (f\y), +°°). 

(vi) Ifp(a) > for some a, then p is decreasing and has a unique zero. 

(vii) If p is negative on an interval (a,b), then it is also decreasing on (a,b). 

Proof. We start by proving (i). It follows directly from Lemma 8(ii) that p is non-increasing 
in fx. Now consider jUi and jU2 in M and < X < 1. p is concave because 

p (Ami + (1 - A)m 2 ) = P(l{y} if - (Ami + (1 - A)/i 2 )D 

= P(AI W [/ - Mi] + (1 - A)I W [/ - M2D 
>P(AI w [/-mi])+P((1-A)I w [/-m 2 ]) 
= AP(I w [/-Mi]) + (l-A)P(I w [/-M 2 ]) 
= Ap(Mi) + (l-A)p(M 2 ), 

where the inequality follows from C3 and the subsequent step is due to C2. To prove that 
p(M) is continuous, consider any Ml and M2 in K, then we see that 

P (M2) = P(I{y} [f - M2]) = P(l{y} [/ - Ml + (Ml - M2)]) 

= P(I W [/ - Mi] + (Mi - M2)) > P(I {y} [/ - Mi]) +P(I W (Mi - M2)) 

= P(Mi)-Z 5 (W)0(M2-Mi), 

where the inequality follows from C3, and the last equality is due to conjugacy and C2. 
Hence |p(Mi) — P(M2)| < IM2 — Mi|^ > ({) ; })' which proves that p is Lipschitz continuous, 
and therefore also continuous. 

To prove (ii), notice that p(min /) = P(I{ y }\f — min /]) > ^(^-^[min / — min /]) = 
and p (max /) = P (I^.j [/ — max /] )E < P (I^j [max / — max /] ) = 0. The inequalities are 
a consequence of Lemma 8(ii), and the last equalities follow from Lemma 6. Since p(m) is 
continuous, this implies the existence of a zero between min / and max /. 

Property (iii) can be proved by considering Mi and M2 in ^ with M2 > Mi ■ If P{{y}) > 0, 
we see that p is decreasing, since 

p(Mi) =P(I W [/-Mi]) =P(I w [/-M2 + (M2-Mi)D 

= P(I W [/ - M2] + l {y} (M2 - Mi )) > m {y } if - M2] ) + P (!{,.} (M2 - Ml )) 
= p(M 2 ) + (M2-Mi)P(W) >P(M2), 
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where the first inequality follows from C3 and the last equality from C2. We know by (ii) 
that p has at least one zero, which must be unique because p is decreasing. 

To prove (iv), first note that P({y}) — also implies P({y}) = 0, because of Lemma 6. 
Now fix ju in R and choose a and b in R such that 

a < min{0,min{/ — jti}} < max{0,max{/ — jj.}} < b. 

Then at the same time p(ju) = P{l{ y }[f - ju]) > P{l{ y }a) = aP({y}) = and p(n) = 
P( l {y}[f - V-}) < P( l {y} b ) = bP -({y}) = °> usin S Lemma 8(ii), C2 and conjugacy. We 
conclude that p (ju) =0 for any fi in R. 

The proof of (v) starts by noticing that p(ju) > for jit G (— °°,£(/|;y)] and p(ju) < for 
M € (P(/b))+°°). due to the definition of (see Equation (11)), and the fact that p is 

non-increasing by (i). In the proof of (iv), we have already shown that p is non-positive if 
P({y}) = 0, which allows us to conclude that p(ju) = for /I e (— °°,P(/b)]- We are left 
to prove that p is decreasing on the interval (P(/|y), +°°). We will do so by contradiction. 
Suppose that p is not decreasing on that interval, then there are jtii and jj.2 in this interval, 
such that jii2 > jJ-i and > p(jU2) > p(jlii). Since p is zero on (— °°,P(/|;y)), we can also 
choose jUo < jUi such that p(jUo) = 0. The existence of such jtio, jJ-i and jj.2 contradicts the 
concavity of p, established by (i). 

To prove (vi), observe that P({y}) > P({y}) > by Lemma 6. This implies that the 
three cases considered in (iii), (iv) and (v) are exhaustive and mutually exclusive. If there is 
an a for which p (a) > 0, we can only have the case considered in (iii), which implies that p 
is decreasing and has a unique zero. 

It now only remains to prove (vii). By repeating the argument in the proof of (vi), we 
see that p is negative on an interval (a,b), only the cases considered in (iii) and (v) can 
obtain. For (iii), p is decreasing on its entire domain. For (v), p is definitely decreasing on 
(a,b). □ 

Lemma 8. Consider a coherent lower prevision Pon^(^) and two gambles f 7 g& '${%). 

(i) ///(*) > g(x)for allxe.% then P(f) > P(g). 

(ii) Iff(x) > g(x)for allxe .% then P(f) > P(g). 

Proof. We start with (i). If f — g is pointwise positive, then min(/ — g) > and therefore 
P(f — g) > min(/ — g) > 0, using CI for the first inequality. It follows from C3 that 
P(f) =P((f-g)+g) >P(f-g)+P(g), and therefore that P(f)-P(g) >P(f-g) > 0, 
whence indeed P(f) > P{g)- 

The proof for (ii) is analogous, but now we only have that min(/ — g) > 0, implying that 
P(f)-P(g)>P(f-g)>rrdn(f-g)>0. ' □ 

Proof of Equation (15). Let A[x k:n ,x k:n ] := l {otn} [I { ^. n} - I{je fc „}]. Since k e {1, . . . ,n - 1} 
and x k =xj c , this implies that 

A[xt.n,xt„}=l {0k j\ xtn }-\x t . n }] 

= I {0 k } I {x k } I {0 k+1:n } hxk+Un} - htk+Un}] 
= ^{o k }^{x k }H x k+l:n,Xk+l:n}, 

which in turn implies that 

P k (A[xt„,xt.n}\z k -i) =Q k (E k (I {o ^I {x ^A[x k+Un ,x k+ i :n ]\X k )\z k -i) 

= Q^( l {x k }E k (l{o k }A[xk+\ m ,h+V.n]\xk)\Zk-\) 

= Q^XkWzk-x) ®E k (I {0k} A[x k+1:n ,x k+1:n ]\x k ) 

= G fc ({^}|zt-i)5 t ({ojfc}|^)0P t+1 (A[xt + i : „,4 + i : „]|xt), 

proving Equation (15). The first equality follows from Equation (5). The second equality 
holds because (z k ) = for all Zk ^ x k , implying that E k (l {ok }l {xk} A[x k+v . ni x k+hn ] \X k ) = 



ESTIMATING STATE SEQUENCES IN IMPRECISE HIDDEN MARKOV MODELS 



33 



\x k }Ek^{o k }^[ x k+i\n,Xk+i-.n\\xk)- The third equality is follows from conjugacy and C2, and 
the last one follows from Equation (4). □ 

Proof of Equation (16). Since x n = x„, Lemma 6 yields: 

U\o n } h*n} - hx n] \Zn-l) = P n {\o n} h*n} - hx n] \Zn-l) = P„(0|z„-l) =0. □ 
Proof of Equation (17). If k G {1, . . . ,n} and x k ^ xt, then 

^k( l {o k:n }^{x k:n }-\x kn }\zk-l) 

= e,(^(iK : „ } [iK : „}-%j]|^)k*-i) 

= e ft (^j^(i{ 0fc jiK +1: jk^+i{4}^(- I {^ : j I {4 + i : „}l^)kfe-i) 

proving Equation (17). The reasons why all these equalities hold, are analogous to the ones 
given in the proof of Equation (15). □ 

Proof of Theorem 3. Fix k G {1, . . . ,n — 1}, Zk-i G .^k-\ an d *fc n € ^ : „- We assume that 
Xk+\-.n ^ op^(^k+i:n\xk,Ok+i- n ) and then show thatx^ ^ opt( J^ : „|z,t-i, £>£:«). It follows 
from the assumption that (I{ 0t+1 . B } P{** +1: „} - 1{4 +1: „} I**) > for some x fe+ i :n € JT ft+1 . 
Now prefix this state sequence x k+ \- n with the state x k to form the state sequence x k:n , 
implying that x k = x k . We then infer from Equation (15) that 

Pj^ l {o k:n }^{x k:n }-\x k:n }\Zk-l) 

= ^({4}kft-l)^({0S:}|4)P*+l(lK +1: „}[I{^ +1: „}-I{4 +1: „}|^) >0, 

which tells us that indeed x k - n ^ o^{^k.n\zk-\i°k:n)- O 

Proof of Equations (33) anc/ (34). First, we consider k — n. For every z n -\ G SE n -\, we 
determine opt ( J^, |z„_ i , o„) as the set of those elements x„ of X n for which 

(vx„ e j;\{x„})e n (i {;c „ } j3r x (^)-i { i„}«(-««)k«-i) < o- 

as this condition is equivalent to the optimality condition (14) for k — n, taking into account 
Equations (16), (17) and (31). We now show that this condition is also equivalent to 

(Vx„ G X n \ {x„})a(x„) > /3r x (x„)0„(x„,x„|z„_i), (43) 

To see this, we consider two different cases. For those x n for which /3™ ax (x„) = 0, the 
inequalities 2 n (I {jCn} j3 M max (x„) -I {x - n] a(x n )\z„-i) < and a(x n ) > j3„ max (x„)0„(x„,x„|z„_ 1 ) 
are both trivially satisfied since a(x„) = S n ({o n }\x n ) > by the positivity assumption (10). 
If j3™ ax (x„) > 0, both inequalities are equivalent because of C2 and Equation (27): 



e„(I N} /r x (x n ) - l {Xn} a{x n )\ Zn -,) <0^Q n (l {Xn} I {i „ } 0^ 

> Q n {x n ,x n \zn— l) 



Zn-l < 



AT X (*«) 

& a(x n ) > j3„ raax (x„)0„(x„,x„|z„_i). 

Using Equation (32), Equation (43) can now be reformulated as a(x n ) > a° pt (x„|z„_i), 
which completes the proof of the equivalence. 

Next, we consider any fee {l,...,n— 1}. Fix Zk-i € 3>k-i, then we must determine 
opt ( Xt.n \ik- i,otn)- We know from the Principle of Optimality (22) that we can limit the 
candidate optimal sequences x k - n to the set cand( X k:n \zk-i,o k:n ). Consider any such x k - n , 
then we must check for any x tn G S£ k:n whether P k (\ 0k . n } ^{x k: „} - !{%.„}] \zk-i) < 0; see 
Equation (14). 
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If xt.n is suc h that Xk = x\, this inequality is automatically satisfied. Indeed, if jc* ^ 
Pos J t(z / t-i), then we infer from Equation (24) that Q k ({xk}\zk-i) = or S k ({o k }\x k ) = 0, 
and then Equation (15) tells us that P k (I{ 0k . n }\^{x k .J - !{%.„}] W-i) = 0- If 4 € Pos t (zjt_i), 
we know from Equation (23) that x^ + i :n € opt(^ + i : „|^,o^ + i : „), which implies that 

by Equation (15). 

This means we can limit ourselves to checking the inequality for those xt. n for which 
x k 7^ x k . So fix any x k ^ x k , then we must check whether 

(Vxfc+i : „ e ^t+i :n )G fc (I{^}j3(x fc „) — I{je t }«(^fcn)|z*-i) < 0; 
see Equation (17). By Equation (28) and Lemma 8, this is equivalent to 

e A (I {xt} j3r x fe)-I{%}a(4:«)kn) < 0, 

which can in turn be seen to be equivalent to a(x k - n ) > ^^(x^Okixk^i^Zk-i), using a 
course of reasoning completely analogous to the one used above for the case k = n. Since 
this inequality must hold for every x k ^ x k , we infer from Equation (32) that we must 
have that a{x k - n ) > 0£^ ,pt (xt|z^_i). So we must check this condition for all the candidate 
sequences x k - n in cand(^k:n\zk-i,Ok :n ), which proves Equation (33). □ 

Proof of Theorem 4. This proof consists of two parts. We will first prove that every sequence 
x k:n obtained by the optimal tree construction is an element of opt ( &k:n\zk-i,°k:n)- Secondly 
we will prove that a sequence zt.n that is not part of the set of sequences obtained by the 
optimal tree construction cannot be an element of opt ( 3£k-.n \z k - 1 , Ot n ) • 

Let us start by proving that every sequence xt M obtained by the optimal tree construction 
is an element of op\.(3£k:n\zk-ii°k:n)- It follows from the last step of the optimal tree 
construction that every xt„ of the constructed set is an element of cand^ (&k:n\zk-i>°k:n)> 
and therefore by Equation (26) also of cand (%k:n\zk-\ ,Ok-.n)- This last step also implies that 
af^{x n )>al v \xk \zk-i), which can be seen to be equivalent with a (xtn) > %° pt fekn), 
by Equation (31) and the repeated use of Equations (35) and (20). It then follows from 
Equation (33) thatxfc„ is an element of opt(%k:n\zk-i,Ok :n )- 

To conclude, we show that a sequence ztn that is not part of the set of sequences obtained 
by the optimal tree construction cannot be an element of opt ( !%k:n \zk- 1 > °t.n ) • If a sequence 
Zt.n is not part of the set of sequences obtained by the optimal tree construction, this either 
implies that it is not an element of cand(^k:n\zk-i,Ok: n ), or that there is some s £ {k, . . . ,n} 
for which af ax (z s ) < CC k pt (zt.s \zk- 1 ) ■ m the first case, it follows directly from Equation (33) 
that ztn cannot be an element of opt(%k-.n\zk-i,Ok:n)- in the second case, we see that 
a s max (z. s ) < cc k pl (zk-. s \zk-i) implies that a(zt. n ) < oc k pt (zk\zk-i), which can be seen to be 
equivalent with <x{ztn) < 0^ P ' (zk \zk- 1 ) by the repeated use of Equations (35) and (20). It 
then follows from Equation (33) that ztn cannot be an element of opt (f^tn\zk-i, tn)- O 

Proof of Theorem 5. If s — k, this can be proved by contradiction. If for all x s G 3£ s both 
conditions would not be fulfilled, the optimal tree construction would stop and the set 
opt (&tn\zk- 1, otn) would be empty. This is a contradiction since every finite partially 
ordered set has at least one maximal element. 

Now consider any sS {k+ 1, . . . ,n}. Equation (28) implies that there is at least one 
sequence x*. n € & s:n for which a s -\{x s -\ ®x*. n ) — a™ a j X (i s _i). We prove that the first state 
x* of this sequence meets both criteria of the theorem. 

We know that Xk :s -i is found using the optimal tree construction, which implies that 
candje fcj _! {^tn\zk-\,otn) is anon-empty set and af^{x s -i) > cc k v \xt s -i\zk-i)- It follows 
from this inequality that a s -i(x s -\ ®x*. n ) > a k p (xk :s -i\zk-i), which can be seen to be 
equivalent with a s (x* n ) > CC k pt (xt s -i ®x*\z k -i) by Equations (20) and (35). Since we know 
that a s (x*. n ) = a, max «) b y Equation (29), we find that a, max (xt) > a^fe-i ®x*\z k -i), 
meaning that x* satisfies the first criterium. 
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To prove that the state x* also satisfies the second criterium, which means that the 

set cand^!©** {5C k 

:n|z*-i)°fc:n) i s non-empty, it suffices by Equation (26) to prove that 
xts-i ®x*. n is an element of cdead(3£k;n\zk-ii°k:n)- 

Since cand^.^, {3£k:n\zk-\i°k:n) i s non-empty, there is at least one z s -.n € ^s-.n for which 
x~k:s- 1 ffi Zs-.n is an element of cand ( Xk:n \zk- 1 1 °t.n) ■ Furthermore, we have chosen x*. n such 
that a s -\(x s -\ ®x*. n ) = af^(x s -i). Lemma 9 now implies that^ :s _i ®x*. n is an element 
of caad(^ : „|z t _i,o fcB ). □ 

Lemma 9. Fix k <E {1, ... ,n — 1}, se {fc + 1, . . . ,n}, Zk-i G 1 € 

Choose an arbitrary x*. n G 3£ S :nfor which a s -i{x s -\ ©x*„) = a^f(x s -i). If there is some 

Zs:n € & S : n for which Xt.s-l®Z S :n belongs to Cand(%k:n\Zk-l,Ok-.n), then Xk:s-1 ® x *s:n belongs 
to CW0A{&k:n\Zk-\,Ok:n)- 

Proof. To simplify the notations in this proof, it is convenient to use %k-\ as an alternative 
notation for Zk- 1 ■ So from now on Xk- 1 = Zk- 1 • 

It follows by Lemma 10 thatx*. n G opt (^ :n |jCs_i, <?,.„). Together with Equation (23), this 
implies thatx s _i ®x*. n G cand( 3£ s -\ :n \x s -2->o s -\ m ). \fs = k+\, this concludes the proof. 
If G {k + 2, . . . , n}, consider all q G {k, . . . , s — 1 } and check af there is some q for which 
x q ^ Pos, q (x q -i) (see definition (24) ). If such a q exists, denote the lowest q G {&,... ,5 — 2} 
for which this is the case as q* . By Equation (23) we know thatf 9 » :i _i ©x*„ wAx q * :s -\ ®z s -.n 
are both elements of cand (X q * :n \x q *-i,o q * :n ), since all sequences in the set x q * © X q *+i; n 
belong to cand ( X q * ■„ \x q * _ i , o q * :n ) . 

If no q G {£:, . . . ,5 — 2} exists for which x ? ^ Pos^(i^_i), we choose g* := s— 1. It 
then follows by the repeated use of Equations (22) and (23) that x,_i ®z s -.n belongs to 
cand (^ s -i:n\xs-2,Os-i:n) and we already know thatx v _i ®x*. n G cand (X s -i :n \x s -2,o s -i:n)- 

We now have aq* G {k, . . .,s— 1} for which both x q *- s -\ ®x*. n andx 9 * :s _i ®z s -.n belong to 
cand (%~ q *:n\x q *-i,o q * :n ) and for which it holds thatx^ G Pos^x^-i) for all q G {k, . . . ,q* — 
1}. If q* = k, this concludes the proof. 

If q* G {k+1,. . . ,s— 1}, notice that cand(X~t. n \zk-i,Ok: n ) is built up by repeatedly using 
Equations (33) and (23). We also know that x q *- s _ \ ®z s:n G cand (X q *- M \x q *-i,o q * :n ) and it 
is given thatx^s-i (Bz s:n belongs to cand {%k-.n\Zk- i,°t.n), which implies that 

a q {x q:s -i ®z s -.n) > a° pt (x<?lVi) for all q G {k, . . . ,q* - 1}. 

Furthermore, a s -i(x s -i ©x*„) = af^(x s -i), so a^_i(^_i ®x*. n ) > a s -i(x s -i ®z s -.n) by 
Equation (28). Equation (20) then tells us that 

Ok{xt:s-i ®x* s:n ) > a t (xf.s-i ®z s -.n) for all f G {k,...,s— 1 } , 

so we know that 

a ? (% s -i 6x* J > a 9 opt (i 9 |x ? _ for all q G ...,?*- 1}. 

This implies (since cand (Xt n \zk-\,ot. n ) is built U P by repeatedly using Equations (33) 
and (23) and because x ? * : . s _i ©x*„ is an element of cand(Xq* :n \xq*-i,o q *; n )) that the 
sequence Xk:s-i ® x *. n belongs to cand {3fct.n\Zk-i , Ok :n ), which concludes the proof. □ 

Lemma 10. Consider any s G {l,...,n}, x s _i G 3£ s -\ and x*. n G X s:n . If 0C s -i(x s -\ © 
4«) = CT^-i)- f/ie "4« G opt(jr 5: „|x s _i,o. $: „). 

Proo/ If a s -i (x s _i ffix* n ) = a^ftx.s-i), then we know by Equation (28) that 

a s -i{x s -i ®x*. n ) > a. v _i(x. s _i ®z s -.n) for allz s: „ G X s:n , 
and therefore by Equations (19) and (7) that 

S s - 1 ({o s - 1 } \x s - 1 )P, (I {4n } I {0j: „ } \x s - 1 ) > S s - 1 ( {o s _ i } \x s -i )P S (I {zs:n } I {0yn] \x s - 1 ) . 
Together with the positivity assumption (10), this implies that 

?.v(I tei }I {0i: „}|x,-i) >P. v (I { -. oi} I Kn} |x,._i) forallz. s: „ G X^. (44) 
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We now by C3 that 

which by conjugacy implies that 

Using Equation (44), we see that Pj(I{ 0j .„} (I{ Zs: „} - !{**.„} ) < for all z i: „ G S" s: „, which 
concludes the proof, since x*. n e opt ( ,^ s - n |x s _ i , o SM ) by Equation (14). □ 



